Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

S Divakar Bhat; Toshihiko Yamasaki

arxiv: 2606.02742 · v2 · pith:MEQH25FGnew · submitted 2026-06-01 · 💻 cs.CV

Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

S Divakar Bhat , Toshihiko Yamasaki This is my paper

Pith reviewed 2026-06-30 10:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsspatial reasoningmetric distancesview consistencyevidence sensitivitymulti-view evaluationViewDiag

0 comments

The pith

Vision-language models often give the same wrong answer to spatial distance questions from different viewpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether consistent answers from VLMs across different views indicate real geometric understanding. Instead, it finds that models frequently produce the same incorrect metric distance regardless of the viewpoint, showing their outputs are not tightly linked to the specific visual details in each image. This is demonstrated using ViewDiag, a new protocol that tracks object pairs in multiple views from real 3D datasets and checks accuracy, answer concentration, and internal model collapse. The results place many models in a region of high consistency but low accuracy, suggesting reliance on learned priors over evidence from the current view. If true, this means cross-view agreement is not a reliable sign of spatial reasoning ability.

Core claim

Leading VLMs produce view-invariant and consistent answers on metric distance queries even when those answers are incorrect. This pattern of high prediction stability paired with substantial error indicates weak coupling between predictions and viewpoint-specific visual evidence, with stable outputs reflecting prior-driven collapse rather than evidence-sensitive reasoning.

What carries the argument

ViewDiag, a multi-view evaluation protocol using object-pair tracks across 2-10 views from Hypersim, ScanNet, and KITTI360 that measures metric accuracy, distributional concentration, and internal collapse via latent feature probe.

If this is right

Cross-view consistency cannot serve as a proxy for geometric understanding in spatial VLMs.
Stable predictions across views likely stem from prior-driven collapse instead of visual evidence.
VLMs require evaluation on evidence coupling in addition to raw accuracy for spatial tasks.
ViewDiag offers a diagnostic tool to check if models are grounded in viewpoint-specific input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds, methods that enforce view-invariance during training may increase evidence insensitivity.
This finding suggests current VLMs may underperform in dynamic or embodied settings where viewpoint changes alter correct answers.
Similar evidence insensitivity could appear in other consistency-based evaluations, such as in video or 3D reconstruction tasks.

Load-bearing premise

The metric distances computed from 3D reconstructions match the visual information present in the 2D images fed to the models.

What would settle it

A model that changes its distance predictions appropriately when the viewpoint alters the true distance, while maintaining overall accuracy, would contradict the evidence insensitivity claim.

Figures

Figures reproduced from arXiv: 2606.02742 by S Divakar Bhat, Toshihiko Yamasaki.

**Figure 1.** Figure 1: View consistency can mislead: geometry baselines such as ZoeDepth [2] track visual evidence and achieve lower error, while typical and spatial VLMs are often consistent yet wrong. [Left] The collapse quadrant (Consistency vs. Accuracy) highlights models that are highly consistent but inaccurate. [Centre] Same-pair traces show predictions that remain fixed across views despite changing evidence. [Right] A b… view at source ↗

**Figure 2.** Figure 2: ViewDiag samples with annotated regions ([R0]/[R1]): [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Controlled multi-view protocol. A fixed region pair is [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Prediction histograms (log bins) for SpatialRGPT, DepthLM, and ZoeDepth. Geometry baselines exhibit broader output distri [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 4.** Figure 4: Controlled multi-view protocol. A fixed region pair is [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Residual vs GT (log bin count) for SpatialRGPT, DepthLM, and ZoeDepth, contrasting spatial VLMs and geometry-based [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Internal collapse scatters for Qwen2-VL-7B, SpatialRGPT, and DepthLM ( [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Top-25 mode profiles for SpatialRGPT, DepthLM, and ZoeDepth, highlighting output concentration in spatial VLMs. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Targeted evidence traces: a VLM collapse case, a [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and internal collapse, the last of which is assessed using a latent feature probe. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy. \noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding. Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating whether spatial VLMs are not only accurate, but also meaningfully coupled to visual evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core claim that high cross-view consistency plus low metric accuracy shows weak evidence coupling is undercut by scale ambiguity in the 2D inputs versus 3D GT.

read the letter

The two things to know are that the paper introduces ViewDiag, a track-based multi-view protocol with 176 object-pair tracks across Hypersim, ScanNet, and KITTI360, and that it reports leading VLMs produce stable but inaccurate answers on metric distance queries.

ViewDiag adds controlled tracks with 2-10 views per pair and three axes: metric accuracy against 3D-derived distances, distributional concentration, and a latent probe for internal collapse. This is a clear step past single-image accuracy tests and the pattern of high stability with substantial error appears consistent across models. The protocol itself is a useful diagnostic addition for spatial VLM evaluation.

The main soft spot is the interpretation. Labeling outputs as incorrect and therefore evidence-insensitive relies on metric ground truth from 3D reconstructions. Each model sees only a single 2D image, where projective geometry leaves absolute scale underdetermined. A model applying stable priors on object sizes or scene scale can produce view-invariant answers that deviate from the 3D numbers without ignoring the available visual evidence. The paper would need to show either that the models have sufficient information to recover the metric or that the deviation persists after controlling for scale to support the weak-coupling conclusion.

The work is empirical benchmarking on public datasets with no obvious circularity or free parameters fitted to the result. Methods details are not fully visible in the abstract, but the design looks reproducible on the surface.

This is for researchers building or auditing spatial reasoning in VLMs for robotics and embodied AI. Readers who want new multi-view diagnostics will find the protocol worth examining even if the main claim needs tightening. It deserves a serious referee to check implementation and the scale issue.

Recommendation: send to peer review.

Referee Report

1 major / 0 minor

Summary. The paper introduces ViewDiag, a controlled multi-view evaluation protocol using 176 object-pair tracks from Hypersim, ScanNet, and KITTI360 (80 scenes, 2-10 views per track). It evaluates leading VLMs on metric distance queries along three axes—metric accuracy, distributional concentration, and internal collapse (via latent feature probe)—and reports a consistent pattern of high cross-view prediction stability paired with substantial error. The central claim is that this indicates weak coupling between predictions and viewpoint-specific visual evidence, with stable outputs reflecting prior-driven collapse rather than evidence-sensitive reasoning; the results challenge the use of consistency as a proxy for geometric understanding.

Significance. If the reported pattern is robust, the work supplies a useful diagnostic benchmark and framework for spatial VLMs, highlighting limitations relevant to robotics and embodied AI. Strengths include the use of multiple public 3D datasets, a multi-view track design, and the addition of a latent probe for internal collapse; these elements support reproducibility and go beyond simple accuracy metrics.

major comments (1)

[Abstract and ViewDiag protocol] Abstract and ViewDiag protocol (description of metric accuracy): The interpretation that outputs deviating from 3D-reconstruction distances are 'incorrect' and demonstrate evidence insensitivity assumes the 3D metric distances are recoverable from the single 2D image inputs. Because of projective scale ambiguity in monocular views, the visual evidence alone does not determine unique metric distances; consistent answers across views could therefore reflect stable prior application rather than failure to couple to viewpoint-specific cues. This assumption is load-bearing for the central claim of weak evidence coupling.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the role of monocular scale ambiguity in interpreting our metric accuracy results. This is a substantive point that bears on the strength of our central claim. We address it directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and ViewDiag protocol] Abstract and ViewDiag protocol (description of metric accuracy): The interpretation that outputs deviating from 3D-reconstruction distances are 'incorrect' and demonstrate evidence insensitivity assumes the 3D metric distances are recoverable from the single 2D image inputs. Because of projective scale ambiguity in monocular views, the visual evidence alone does not determine unique metric distances; consistent answers across views could therefore reflect stable prior application rather than failure to couple to viewpoint-specific cues. This assumption is load-bearing for the central claim of weak evidence coupling.

Authors: We agree that absolute metric distances are underdetermined from any single monocular image due to projective scale ambiguity. ViewDiag therefore does not assume unique recoverability from individual views. Instead, each object-pair track supplies multiple views that differ in perspective, apparent scale, and contextual cues. The key observation is that model outputs remain highly stable across these distinct inputs while deviating from the ground-truth 3D distances obtained from the scene reconstructions. This pattern is consistent with reliance on view-invariant priors rather than adaptation to viewpoint-specific evidence. We will revise the abstract and the ViewDiag protocol section to explicitly acknowledge monocular scale ambiguity and to frame the evidence-insensitivity claim in terms of cross-view stability rather than per-image recoverability. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking with no circular derivation chain

full rationale

The paper is an empirical benchmarking study that introduces the ViewDiag protocol on external public datasets (Hypersim, ScanNet, KITTI360) and measures model outputs against independently defined axes of metric accuracy, distributional concentration, and internal collapse. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the provided text; the central claim follows directly from the observed patterns on these fixed external benchmarks rather than reducing to any input by construction. This is the normal case of a self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the 3D scene reconstructions provide accurate ground-truth distances that should be recoverable from individual 2D views. No free parameters or invented entities are introduced. The only axioms are standard ones about dataset fidelity.

axioms (1)

domain assumption The 3D reconstructions in Hypersim, ScanNet, and KITTI360 yield accurate metric ground truth for object-pair distances.
Invoked when defining the evaluation targets for the ViewDiag protocol.

pith-pipeline@v0.9.1-grok · 5757 in / 1309 out tokens · 26302 ms · 2026-06-30T10:31:20.484532+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 3

2022
[2]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 1, 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, and Yangyang Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

work page arXiv
[4]

Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14455–14465,
[5]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 3

2024
[6]

Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024. 1, 3, 6, 7

2024
[7]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 5828–5839, 2017. 4

2017
[8]

Chapman and Hall/CRC, 1993

Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap. Chapman and Hall/CRC, 1993. 5

1993
[9]

a photo of{name}weather

Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jion- grui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, et al. Seeing across views: Benchmark- ing spatial reasoning of vision-language models in robotic scenes.arXiv preprint arXiv:2510.19400, 2025. 2

work page arXiv 2025
[10]

Shortcut learning in deep neural networks

Robert Geirhos, J ¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Fe- lix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020. 4

2020
[11]

Digging into self-supervised monocular depth estimation

Cl ´ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3828–3838,
[12]

3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,
[13]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 3

2019
[14]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017. 3

2017
[15]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 3

2022
[16]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023. 3

2023
[17]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 2041–2050, 2018. 3

2041
[18]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2023

Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2023. 4

2023
[19]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 3, 6, 7

2024
[20]

Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 3, 6, 7, 8

2023
[21]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 3

2021
[22]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12179–12188, 2021. 3

2021
[23]

Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10912–10922, 2021. 4

2021
[24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

From indoor to open world: Revealing the spatial reasoning gap in mllms

Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, and Tong Zhang. From indoor to open world: Revealing the spatial reasoning gap in mllms. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16789–16799, 2026. 2

2026
[26]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024. 3

2024
[27]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Seeing from another perspective: Evaluating multi-view understanding in mllms

Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 12000–12008, 2026. 3

2026
[29]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 3

2022

[2] [2]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 1, 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, and Yangyang Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,

work page arXiv

[4] [4]

Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14455–14465,

[5] [5]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 3

2024

[6] [6]

Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024. 1, 3, 6, 7

2024

[7] [7]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 5828–5839, 2017. 4

2017

[8] [8]

Chapman and Hall/CRC, 1993

Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap. Chapman and Hall/CRC, 1993. 5

1993

[9] [9]

a photo of{name}weather

Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jion- grui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, et al. Seeing across views: Benchmark- ing spatial reasoning of vision-language models in robotic scenes.arXiv preprint arXiv:2510.19400, 2025. 2

work page arXiv 2025

[10] [10]

Shortcut learning in deep neural networks

Robert Geirhos, J ¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Fe- lix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020. 4

2020

[11] [11]

Digging into self-supervised monocular depth estimation

Cl ´ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3828–3838,

[12] [12]

3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

[13] [13]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 3

2019

[14] [14]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017. 3

2017

[15] [15]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 3

2022

[16] [16]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023. 3

2023

[17] [17]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 2041–2050, 2018. 3

2041

[18] [18]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2023

Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2023. 4

2023

[19] [19]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 3, 6, 7

2024

[20] [20]

Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 3, 6, 7, 8

2023

[21] [21]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 3

2021

[22] [22]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12179–12188, 2021. 3

2021

[23] [23]

Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10912–10922, 2021. 4

2021

[24] [24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

From indoor to open world: Revealing the spatial reasoning gap in mllms

Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, and Tong Zhang. From indoor to open world: Revealing the spatial reasoning gap in mllms. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16789–16799, 2026. 2

2026

[26] [26]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024. 3

2024

[27] [27]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Seeing from another perspective: Evaluating multi-view understanding in mllms

Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 12000–12008, 2026. 3

2026

[29] [29]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025