pith. sign in

arxiv: 2606.02742 · v2 · pith:MEQH25FGnew · submitted 2026-06-01 · 💻 cs.CV

Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

Pith reviewed 2026-06-30 10:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsspatial reasoningmetric distancesview consistencyevidence sensitivitymulti-view evaluationViewDiag
0
0 comments X

The pith

Vision-language models often give the same wrong answer to spatial distance questions from different viewpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether consistent answers from VLMs across different views indicate real geometric understanding. Instead, it finds that models frequently produce the same incorrect metric distance regardless of the viewpoint, showing their outputs are not tightly linked to the specific visual details in each image. This is demonstrated using ViewDiag, a new protocol that tracks object pairs in multiple views from real 3D datasets and checks accuracy, answer concentration, and internal model collapse. The results place many models in a region of high consistency but low accuracy, suggesting reliance on learned priors over evidence from the current view. If true, this means cross-view agreement is not a reliable sign of spatial reasoning ability.

Core claim

Leading VLMs produce view-invariant and consistent answers on metric distance queries even when those answers are incorrect. This pattern of high prediction stability paired with substantial error indicates weak coupling between predictions and viewpoint-specific visual evidence, with stable outputs reflecting prior-driven collapse rather than evidence-sensitive reasoning.

What carries the argument

ViewDiag, a multi-view evaluation protocol using object-pair tracks across 2-10 views from Hypersim, ScanNet, and KITTI360 that measures metric accuracy, distributional concentration, and internal collapse via latent feature probe.

If this is right

  • Cross-view consistency cannot serve as a proxy for geometric understanding in spatial VLMs.
  • Stable predictions across views likely stem from prior-driven collapse instead of visual evidence.
  • VLMs require evaluation on evidence coupling in addition to raw accuracy for spatial tasks.
  • ViewDiag offers a diagnostic tool to check if models are grounded in viewpoint-specific input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pattern holds, methods that enforce view-invariance during training may increase evidence insensitivity.
  • This finding suggests current VLMs may underperform in dynamic or embodied settings where viewpoint changes alter correct answers.
  • Similar evidence insensitivity could appear in other consistency-based evaluations, such as in video or 3D reconstruction tasks.

Load-bearing premise

The metric distances computed from 3D reconstructions match the visual information present in the 2D images fed to the models.

What would settle it

A model that changes its distance predictions appropriately when the viewpoint alters the true distance, while maintaining overall accuracy, would contradict the evidence insensitivity claim.

Figures

Figures reproduced from arXiv: 2606.02742 by S Divakar Bhat, Toshihiko Yamasaki.

Figure 1
Figure 1. Figure 1: View consistency can mislead: geometry baselines such as ZoeDepth [2] track visual evidence and achieve lower error, while typical and spatial VLMs are often consistent yet wrong. [Left] The collapse quadrant (Consistency vs. Accuracy) highlights models that are highly consistent but inaccurate. [Centre] Same-pair traces show predictions that remain fixed across views despite changing evidence. [Right] A b… view at source ↗
Figure 2
Figure 2. Figure 2: ViewDiag samples with annotated regions ([R0]/[R1]): [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Controlled multi-view protocol. A fixed region pair is [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prediction histograms (log bins) for SpatialRGPT, DepthLM, and ZoeDepth. Geometry baselines exhibit broader output distri [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Controlled multi-view protocol. A fixed region pair is [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Residual vs GT (log bin count) for SpatialRGPT, DepthLM, and ZoeDepth, contrasting spatial VLMs and geometry-based [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Internal collapse scatters for Qwen2-VL-7B, SpatialRGPT, and DepthLM ( [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Top-25 mode profiles for SpatialRGPT, DepthLM, and ZoeDepth, highlighting output concentration in spatial VLMs. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Targeted evidence traces: a VLM collapse case, a [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and internal collapse, the last of which is assessed using a latent feature probe. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy. \noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding. Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating whether spatial VLMs are not only accurate, but also meaningfully coupled to visual evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces ViewDiag, a controlled multi-view evaluation protocol using 176 object-pair tracks from Hypersim, ScanNet, and KITTI360 (80 scenes, 2-10 views per track). It evaluates leading VLMs on metric distance queries along three axes—metric accuracy, distributional concentration, and internal collapse (via latent feature probe)—and reports a consistent pattern of high cross-view prediction stability paired with substantial error. The central claim is that this indicates weak coupling between predictions and viewpoint-specific visual evidence, with stable outputs reflecting prior-driven collapse rather than evidence-sensitive reasoning; the results challenge the use of consistency as a proxy for geometric understanding.

Significance. If the reported pattern is robust, the work supplies a useful diagnostic benchmark and framework for spatial VLMs, highlighting limitations relevant to robotics and embodied AI. Strengths include the use of multiple public 3D datasets, a multi-view track design, and the addition of a latent probe for internal collapse; these elements support reproducibility and go beyond simple accuracy metrics.

major comments (1)
  1. [Abstract and ViewDiag protocol] Abstract and ViewDiag protocol (description of metric accuracy): The interpretation that outputs deviating from 3D-reconstruction distances are 'incorrect' and demonstrate evidence insensitivity assumes the 3D metric distances are recoverable from the single 2D image inputs. Because of projective scale ambiguity in monocular views, the visual evidence alone does not determine unique metric distances; consistent answers across views could therefore reflect stable prior application rather than failure to couple to viewpoint-specific cues. This assumption is load-bearing for the central claim of weak evidence coupling.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the role of monocular scale ambiguity in interpreting our metric accuracy results. This is a substantive point that bears on the strength of our central claim. We address it directly below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and ViewDiag protocol] Abstract and ViewDiag protocol (description of metric accuracy): The interpretation that outputs deviating from 3D-reconstruction distances are 'incorrect' and demonstrate evidence insensitivity assumes the 3D metric distances are recoverable from the single 2D image inputs. Because of projective scale ambiguity in monocular views, the visual evidence alone does not determine unique metric distances; consistent answers across views could therefore reflect stable prior application rather than failure to couple to viewpoint-specific cues. This assumption is load-bearing for the central claim of weak evidence coupling.

    Authors: We agree that absolute metric distances are underdetermined from any single monocular image due to projective scale ambiguity. ViewDiag therefore does not assume unique recoverability from individual views. Instead, each object-pair track supplies multiple views that differ in perspective, apparent scale, and contextual cues. The key observation is that model outputs remain highly stable across these distinct inputs while deviating from the ground-truth 3D distances obtained from the scene reconstructions. This pattern is consistent with reliance on view-invariant priors rather than adaptation to viewpoint-specific evidence. We will revise the abstract and the ViewDiag protocol section to explicitly acknowledge monocular scale ambiguity and to frame the evidence-insensitivity claim in terms of cross-view stability rather than per-image recoverability. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking with no circular derivation chain

full rationale

The paper is an empirical benchmarking study that introduces the ViewDiag protocol on external public datasets (Hypersim, ScanNet, KITTI360) and measures model outputs against independently defined axes of metric accuracy, distributional concentration, and internal collapse. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the provided text; the central claim follows directly from the observed patterns on these fixed external benchmarks rather than reducing to any input by construction. This is the normal case of a self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the 3D scene reconstructions provide accurate ground-truth distances that should be recoverable from individual 2D views. No free parameters or invented entities are introduced. The only axioms are standard ones about dataset fidelity.

axioms (1)
  • domain assumption The 3D reconstructions in Hypersim, ScanNet, and KITTI360 yield accurate metric ground truth for object-pair distances.
    Invoked when defining the evaluation targets for the ViewDiag protocol.

pith-pipeline@v0.9.1-grok · 5757 in / 1309 out tokens · 26302 ms · 2026-06-30T10:31:20.484532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 3

  2. [2]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 1, 3, 5, 6, 7, 8

  3. [3]

    Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

    Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, and Yangyang Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

  4. [4]

    Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14455–14465,

  5. [5]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 3

  6. [6]

    Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024. 1, 3, 6, 7

  7. [7]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 5828–5839, 2017. 4

  8. [8]

    Chapman and Hall/CRC, 1993

    Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap. Chapman and Hall/CRC, 1993. 5

  9. [9]

    a photo of{name}weather

    Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jion- grui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, et al. Seeing across views: Benchmark- ing spatial reasoning of vision-language models in robotic scenes.arXiv preprint arXiv:2510.19400, 2025. 2

  10. [10]

    Shortcut learning in deep neural networks

    Robert Geirhos, J ¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Fe- lix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020. 4

  11. [11]

    Digging into self-supervised monocular depth estimation

    Cl ´ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3828–3838,

  12. [12]

    3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

  13. [13]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 3

  14. [14]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017. 3

  15. [15]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 3

  16. [16]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023. 3

  17. [17]

    Megadepth: Learning single- view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 2041–2050, 2018. 3

  18. [18]

    Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2023

    Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2023. 4

  19. [19]

    Vila: On pre-training for vi- sual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 3, 6, 7

  20. [20]

    Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 3, 6, 7, 8

  21. [21]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 3

  22. [22]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12179–12188, 2021. 3

  23. [23]

    Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10912–10922, 2021. 4

  24. [24]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 5, 6, 7, 8

  25. [25]

    From indoor to open world: Revealing the spatial reasoning gap in mllms

    Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, and Tong Zhang. From indoor to open world: Revealing the spatial reasoning gap in mllms. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16789–16799, 2026. 2

  26. [26]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024. 3

  27. [27]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 3, 5, 6, 7, 8

  28. [28]

    Seeing from another perspective: Evaluating multi-view understanding in mllms

    Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 12000–12008, 2026. 3

  29. [29]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 5, 6, 7, 8