pith. sign in

arxiv: 2605.30161 · v1 · pith:JSRXVPDUnew · submitted 2026-05-28 · 💻 cs.CV

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Pith reviewed 2026-06-29 08:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsspatial reasoningrepresentation analysisperspective biasembedding entanglementsynthetic benchmarkcontrastive evaluation
0
0 comments X

The pith

Vision-language models consistently entangle vertical image position with distance in their embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether strong benchmark scores in VLMs reflect genuine 3D spatial understanding or dependence on statistical patterns from natural photographs. Analysis of embedding spaces across model families shows a persistent vertical-distance entanglement that mirrors the perspective bias typical of camera images. This mixing produces clear accuracy drops on examples that violate the usual photo heuristics, and the entanglement grows stronger even as overall accuracy rises with more training data. A new synthetic benchmark called SpatialTunnel removes natural-image correlations to confirm the bias is intrinsic to the models rather than an artifact of test sets. Models whose spatial axes remain better separated show improved robustness on varied spatial reasoning tasks.

Core claim

VLMs organize spatial information such that vertical position in the image plane becomes entangled with inferred distance, reproducing the perspective statistics of training photographs; this produces measurable accuracy gaps on counter-heuristic cases, scales with data volume, and is isolated from evaluation skew by the SpatialTunnel benchmark.

What carries the argument

Minimal contrastive pairs that probe organization and disentanglement of spatial axes inside VLM embeddings, together with the SpatialTunnel synthetic benchmark that removes natural-image correlations.

If this is right

  • Accuracy differs systematically between perspective-consistent and counter-heuristic test items.
  • The vertical-distance entanglement strengthens under continued data scaling.
  • Models with similar benchmark scores can still differ in internal spatial structure, and those differences forecast performance on other spatial tasks.
  • Greater separation of spatial axes inside the model predicts better robustness across benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training on photographic data may embed perspective statistics so deeply that they resist removal by scale alone.
  • The contrastive-pair method could be reused to diagnose other unintended entanglements, such as between lighting and material properties.
  • Architectures that explicitly encourage axis separation during pre-training might reduce reliance on such shortcuts.
  • Synthetic benchmarks like SpatialTunnel offer a route to measure whether future models have moved beyond image-plane heuristics.

Load-bearing premise

The constructed contrastive pairs and SpatialTunnel benchmark succeed in isolating only the targeted spatial axes without introducing new unintended correlations.

What would settle it

Finding a model family that exhibits no measurable vertical-distance correlation on the contrastive pairs yet still shows the same accuracy gap on counter-perspective examples, or a model where forcing axis separation produces no robustness gain on SpatialTunnel.

Figures

Figures reproduced from arXiv: 2605.30161 by Chan Hee Song, Cheolhong Min, Daeun Lee, Hyeonseong Jeon, Jaesik Park, Jaeyun Jung, Jonathan Tremblay, Yu Su.

Figure 1
Figure 1. Figure 1: Many VLMs answer spatial questions via a perspective-driven shortcut, e.g., objects located higher in the image are further away in 3D. By confusing 2D verti￾cal position with 3D distance, models fail systematically on counter examples. Our SpatialTunnel benchmark and contrastive probing expose this vertical-distance en￾tanglement. In contrast, strong spatial VLMs show disentangled axes and consistent corr… view at source ↗
Figure 2
Figure 2. Figure 2: Consistent vs. counter ex￾amples. Consistent: Farther object appears higher in the image; Counter: Farther object appears lower. Perspective projection and vertical position. From the observer’s viewpoint, objects farther away on a common ground surface appear higher in the image. This phenomenon gives rise to the classical ele￾vation cue: for objects lying on the ground plane, those nearer to the horizon … view at source ↗
Figure 3
Figure 3. Figure 3: SpatialTunnel holds the two objects at fixed depths while sweeping their angular positions around the tunnel cross-section, so that 2D image-plane layout varies independently of depth ordering. jects near the top and bottom of the image can be equidistant from the camera, the common heuristic “higher in the image ⇒ farther” no longer holds. We pa￾rameterize each object by its depth z and an angular positio… view at source ↗
Figure 4
Figure 4. Figure 4: Mean accuracy heatmaps on SpatialTunnel for Molmo-7B. Each cell indexes a joint angular configuration (θ1, θ2) of the two objects (red = higher accuracy; blue = lower). Gray indicates configurations outside the subset. From base → 400k → 2M training samples, accuracy on (a) perspective-consistent cells improves steadily. In contrast, (b) counter cells remain substantially harder, with the largest drop at 4… view at source ↗
Figure 5
Figure 5. Figure 5: Contrastive probing for representation-level spatial analysis. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Internal probing analysis of spatial representations. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: PCA of delta vectors across models. Each point is a delta vector colored by axis (orange: horizontal, green: vertical, purple: distance), with darker/lighter shades distinguishing opposing categories within each axis (e.g., left vs. right). Molmo (2M), NVILA (2M), and Qwen (2M) show separation along the horizontal and vertical axes, but distance delta vectors remain poorly distinguished. RoboRefer and Qwen… view at source ↗
Figure 8
Figure 8. Figure 8: Object-size variation in SpatialTunnel. A representative scene rendered under six (s1, s2) configurations with s1 + s2 = 0.4, where s1 and s2 denote the sizes of obj1 and obj2 , respectively. obj1 is always farther from the camera than obj2 . As s1 increases from left to right, the farther object grows while the nearer object shrinks, moving from a size-consistent to a size-conflicting configuration. 0.300… view at source ↗
Figure 9
Figure 9. Figure 9: Correctness as a function of object size. [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distance Coherence measured on synthetic ( [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cross-category similarity heatmaps for the Molmo family. [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Cross-category similarity heatmaps for the NVILA family. [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cross-category similarity heatmaps for the Qwen family. [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: 2D PCA of delta vectors for the Molmo family. [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: 2D PCA of delta vectors for the NVILA family. [PITH_FULL_IMAGE:figures/full_fig_p039_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: 2D PCA of delta vectors for the Qwen family. [PITH_FULL_IMAGE:figures/full_fig_p040_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: 3D PCA of delta vectors for the Molmo family. [PITH_FULL_IMAGE:figures/full_fig_p041_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: 3D PCA of delta vectors for the NVILA family. [PITH_FULL_IMAGE:figures/full_fig_p042_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: 3D PCA of delta vectors for the Qwen family. [PITH_FULL_IMAGE:figures/full_fig_p043_19.png] view at source ↗
read the original abstract

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a representation-level analysis framework using minimal contrastive pairs to probe how VLMs organize spatial axes in embeddings. Across model families it reports a consistent vertical-distance entanglement that mirrors perspective bias in natural photographs; this produces accuracy gaps on perspective-consistent vs. counter-heuristic examples, persists under scaling, and is isolated from evaluation skew via the new synthetic SpatialTunnel benchmark. The work further claims that models with better-separated spatial axes show greater robustness on diverse spatial-reasoning benchmarks, with code and benchmark released.

Significance. If the isolation claims hold, the results would demonstrate that apparent spatial competence in VLMs often reflects statistical shortcuts rather than structured 3D understanding, and that internal representational geometry predicts downstream robustness. The release of code and the SpatialTunnel benchmark is a clear strength for reproducibility and follow-up work.

major comments (2)
  1. [Framework and benchmark construction] Framework and benchmark construction (abstract and § on SpatialTunnel): the central claim that the entanglement is model-intrinsic rests on the assertion that minimal contrastive pairs isolate only vertical position and distance while SpatialTunnel removes all common correlations. No quantitative verification (e.g., statistical tests on low-level image statistics such as scale, lighting gradients, or edge distributions before/after position changes) is described; without such checks the measured entanglement could arise from residual synthesis artifacts rather than internal bias.
  2. [Scaling and robustness experiments] Results on scaling and robustness (section reporting scaling experiments): the claim that the bias intensifies under data scaling while overall accuracy improves is load-bearing for the argument that shortcut reliance is not alleviated by scale. The reported accuracy gaps and correlation with representation quality should be accompanied by explicit controls for model size, training data volume, and benchmark difficulty to ensure the entanglement metric is not confounded by these factors.
minor comments (2)
  1. Notation for embedding axes and distance metrics should be defined once in a dedicated subsection rather than introduced piecemeal across figures.
  2. Figure captions for the contrastive-pair examples should explicitly state the exact pixel or 3D coordinate changes applied while holding the orthogonal axis fixed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which help clarify the presentation of our results. Below we address the major comments point by point.

read point-by-point responses
  1. Referee: [Framework and benchmark construction] Framework and benchmark construction (abstract and § on SpatialTunnel): the central claim that the entanglement is model-intrinsic rests on the assertion that minimal contrastive pairs isolate only vertical position and distance while SpatialTunnel removes all common correlations. No quantitative verification (e.g., statistical tests on low-level image statistics such as scale, lighting gradients, or edge distributions before/after position changes) is described; without such checks the measured entanglement could arise from residual synthesis artifacts rather than internal bias.

    Authors: We agree that providing quantitative verification of the isolation in SpatialTunnel would strengthen the manuscript. In the revised version, we will add an analysis section reporting statistical tests (such as t-tests or distribution comparisons) on low-level image statistics including scale, lighting gradients, and edge distributions for the minimal contrastive pairs and benchmark variants. This will confirm that position changes do not introduce unintended correlations. revision: yes

  2. Referee: [Scaling and robustness experiments] Results on scaling and robustness (section reporting scaling experiments): the claim that the bias intensifies under data scaling while overall accuracy improves is load-bearing for the argument that shortcut reliance is not alleviated by scale. The reported accuracy gaps and correlation with representation quality should be accompanied by explicit controls for model size, training data volume, and benchmark difficulty to ensure the entanglement metric is not confounded by these factors.

    Authors: We appreciate this point. Our scaling experiments compare models of different sizes within and across families, but we acknowledge the need for more explicit controls. In revision, we will add controls by reporting the entanglement metric alongside model parameter counts and include a discussion of how benchmark difficulty is held constant by using identical test sets. For training data volume, as this information is not always available for proprietary models, we will note this limitation and focus on available controls for model size. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on external models and new benchmark

full rationale

The paper reports direct empirical observations of spatial representations in existing VLMs via minimal contrastive pairs and the newly introduced SpatialTunnel benchmark. No equations, derivations, fitted parameters presented as predictions, or self-citation chains are used to establish the central claims. The analysis measures entanglement in model embeddings and benchmark performance gaps without reducing any result to a tautological input from the same data or prior self-work. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and introduces no new theoretical entities or fitted constants; it relies on standard machine-learning assumptions about what embeddings encode.

axioms (1)
  • domain assumption Contrastive image pairs can isolate individual spatial axes in VLM embeddings
    Invoked when constructing the representation-level analysis framework.

pith-pipeline@v0.9.1-grok · 5780 in / 1133 out tokens · 32346 ms · 2026-06-29T08:04:25.090423+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 22 canonical work pages · 14 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022) 1

  2. [2]

    Anthropic: Claude opus 4 & claude sonnet 4 system card. Tech. rep., Anthropic (May 2025),https : / / www - cdn . anthropic . com / 4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf1

  3. [3]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 2, 5, 7, 11, 22

  4. [4]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 5, 7, 11, 21

  5. [5]

    Blender Foundation (2025),https://www.blender.org/7

    Blender Online Community: Blender - a 3D modelling and rendering package. Blender Foundation (2025),https://www.blender.org/7

  6. [6]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brazil, G., Kumar, A., Straub, J., Ravi, N., Johnson, J., Gkioxari, G.: Omni3d: A large benchmark and model for 3d object detection in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13154– 13164 (2023) 24

  7. [7]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor envi- ronments. arXiv preprint arXiv:1709.06158 (2017) 24

  8. [8]

    ShapeNet: An Information-Rich 3D Model Repository

    Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015) 23

  9. [9]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 14455–14465 (2024) 5

  10. [10]

    arXiv e-prints pp

    Chen, H., Lin, J., Chen, X., Fan, Y., Jin, X., Su, H., Dong, J., Fu, J., Shen, X.: Rethinking visual layer selection in multimodal llms. arXiv e-prints pp. arXiv–2504 (2025) 12, 32

  11. [11]

    In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id=k7vcuqLK4X3, 4

    Chen, S., Zhu, T., Zhou, R., Zhang, J., Gao, S., Niebles, J.C., Geva, M., He, J., Wu, J., Li, M.: Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus areas. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id=k7vcuqLK4X3, 4

  12. [12]

    SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

    Chen, S., Uy, M.A., Song, C.H., Ladhak, F., Murali, A., Qu, Q., Birchfield, S., Blukis, V., Tremblay, J.: SpaceTools: Tool-augmented spatial reasoning via double interactive RL. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026),https://arxiv.org/abs/2512.04069, to appear 2

  13. [13]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 2

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 2

  14. [14]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017) 22, 24 16 C. Min et al

  15. [15]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Danier, D., Aygün, M., Li, C., Bilen, H., Mac Aodha, O.: Depthcues: Evaluating monocular depth perception in large vision models. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 20049–20059 (2025) 4, 5, 21

  16. [16]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 91–104 (2025) 2, 5, 7, 11, 21

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023) 23

  18. [18]

    Advances in Neural Information Processing Systems 35, 5982–5994 (2022) 22

    Deitke, M., VanderBilt, E., Herrasti, A., Weihs, L., Ehsani, K., Salvador, J., Han, W., Kolve, E., Kembhavi, A., Mottaghi, R.: Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems 35, 5982–5994 (2022) 22

  19. [19]

    arXiv preprint arXiv:2505.13441 (2025) 5, 6, 23, 24

    Deshpande, A., Deng, Y., Ray, A., Salvador, J., Han, W., Duan, J., Zeng, K.H., Zhu, Y., Krishna, R., Hendrix, R.: Graspmolmo: Generalizable task-oriented grasp- ing via large-scale synthetic data generation. arXiv preprint arXiv:2505.13441 (2025) 5, 6, 23, 24

  20. [20]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

    Du, M., Wu, B., Li, Z., Huang, X.J., Wei, Z.: Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 346–355 (2024) 2, 3, 4, 6, 12, 24

  21. [21]

    In: 2021 IEEE International Conference on Robotics and Automation (ICRA)

    Eppner, C., Mousavian, A., Fox, D.: Acronym: A large-scale grasp dataset based on simulation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). pp. 6222–6227. IEEE (2021) 23

  22. [22]

    Journal of Open Source Software10(105), 7561 (2025) 23

    Eppner, C., Murali, A., Garrett, C., O’Flaherty, R., Hermans, T., Yang, W., Fox, D.: scene_synthesizer: A python library for procedural scene generation in robot manipulation. Journal of Open Source Software10(105), 7561 (2025) 23

  23. [23]

    In: European Conference on Computer Vision

    Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024) 2, 3, 10, 25

  24. [24]

    Google: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities (2025),https://arxiv.org/ abs/2507.062611, 27

  25. [25]

    In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

    Gurnee, W., Tegmark, M.: Language models represent space and time. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on Learning Representations. vol. 2024, pp. 2483– 2503 (2024),https://proceedings.iclr.cc/paper_files/paper/2024/file/ 0a6059857ae5c82ea9726ee9282a7145-Paper-Conference.pdf12, 32

  26. [26]

    url: https://web

    Hata, K., Savarese, S.: Cs231a course notes 1: Camera models. url: https://web. stanford. edu/class/cs231a/course_ notes/01-camera-models. pdf (2015) 20

  27. [27]

    Springer Nature (2022) 21

    Hoiem, D., Savarese, S.: Representations and techniques for 3D object recognition and scene interpretation. Springer Nature (2022) 21

  28. [28]

    In: Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing

    Hu, J., Levy, R.: Prompting is not a substitute for probability measurements in large language models. In: Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing. pp. 5040–5060 (2023) 8 Why Far Looks Up 17

  29. [29]

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springenberg, J.T., Stachow...

  30. [30]

    In: Proceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP) (2023) 3

    Kamath, A., Hessel, J., Chang, K.W.: What’s “up” with vision-language models? investigating their struggle with spatial reasoning. In: Proceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP) (2023) 3

  31. [31]

    arXiv preprint arXiv:2601.12626 (2026) 4

    Kang, R., Chen, H., Gkioxari, G., Perona, P.: Linear mechanisms for spatiotem- poral reasoning in vision language models. arXiv preprint arXiv:2601.12626 (2026) 4

  32. [32]

    In: Agrawal, P., Kroemer, O., Burgard, W

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: Openvla: An open-source vision-language-action model. In: Agrawal, P., Kroemer, O., Burgard, W. (eds.) Proceedings of the 8th Conference on R...

  33. [33]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Deitke, M.,Ehsani,K.,Gordon,D.,Zhu,Y.,etal.:Ai2-thor:Aninteractive3denvironment for visual ai. arXiv preprint arXiv:1712.05474 (2017) 24

  34. [34]

    International journal of computer vision123(1), 32–73 (2017) 24

    Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision123(1), 32–73 (2017) 24

  35. [35]

    International journal of computer vision128(7), 1956–1981 (2020) 23

    Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision128(7), 1956–1981 (2020) 23

  36. [36]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Lazarow, J., Griffiths, D., Kohavi, G., Crespo, F., Dehghan, A.: Cubify anything: Scaling indoor 3d object detection. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22225–22233 (2025) 23

  37. [37]

    arXiv preprint arXiv:2510.12276 (2025) 4

    Li, F., Song, W., Zhao, H., Wang, J., Ding, P., Wang, D., Zeng, L., Li, H.: Spatial forcing: Implicit spatial representation alignment for vision-language-action model. arXiv preprint arXiv:2510.12276 (2025) 4

  38. [38]

    In: European conference on computer vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014) 24

  39. [39]

    Improved Baselines with Visual Instruction Tuning

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023) 2

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4122– 4134 (2025) 5, 7, 11, 21

  41. [41]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, :, Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L.J., Fang, Y., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., Magne, L., Mandlekar, A., 18 C. Min et al. Narayan, A., Nasiriany, S., Reed, S., Tan, Y.L., Wang, G., Wang, Z., Wang, J., Wang, Q., X...

  42. [42]

    OpenAI: Openai gpt-5 system card (2025),https://arxiv.org/abs/2601.03267 1, 27

  43. [43]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 32

  44. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Raistrick, A., Mei, L., Kayan, K., Yan, D., Zuo, Y., Han, B., Wen, H., Parakh, M., Alexandropoulos, S., Lipson, L., et al.: Infinigen indoors: Photorealistic indoor scenes using procedural generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21783–21794 (2024) 23

  45. [45]

    In: Second Conference on Language Modeling (2025),https://openreview.net/forum?id=DW8U8ZWa1U 4, 5, 6, 22, 24

    Ray, A., Duan, J., II, E.L.B., Tan, R., Bashkirova, D., Hendrix, R., Ehsani, K., Kembhavi,A.,Plummer,B.A.,Krishna,R.,Zeng,K.H.,Saenko,K.:SAT:Dynamic spatial aptitude training for multimodal language models. In: Second Conference on Language Modeling (2025),https://openreview.net/forum?id=DW8U8ZWa1U 4, 5, 6, 22, 24

  46. [46]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    Savva, M., Chang, A.X., Hanrahan, P.: Semantically-enriched 3d models for common-sense knowledge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 24–31 (2015) 23

  47. [47]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

    Sheta, H., Huang, E.H., Wu, S., Alenabi, I., Hong, J., Lin, R., Ning, R., Wei, D., Yang, J., Zhou, J., et al.: From behavioral performance to internal compe- tence: Interpreting vision-language models with vlm-lens. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 886–895 (2025) 4

  48. [48]

    V2xp-asg: Generating adversarial scenes for vehicle-to-everything perception

    Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., Garg, A.: Progprompt: Generating situated robot task plans us- ing large language models. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 11523–11530 (2023).https://doi.org/10.1109/ ICRA48891.2023.101613171

  49. [49]

    In: Forty-second International Conference on Machine Learning (2025),https: //openreview.net/forum?id=WGXb7UdvTX12, 32

    Skean, O., Arefin, M.R., Zhao, D., Patel, N.N., Naghiyev, J., LeCun, Y., Shwartz- Ziv, R.: Layer by layer: Uncovering hidden representations in language models. In: Forty-second International Conference on Machine Learning (2025),https: //openreview.net/forum?id=WGXb7UdvTX12, 32

  50. [50]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Song, C.H., Blukis, V., Tremblay, J., Tyree, S., Su, Y., Birchfield, S.: Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15768–15780 (2025) 2, 4, 5, 6, 22, 24

  51. [51]

    In: ICCV (2023) 1

    Song, C.H., Wu, J., Washington, C., Sadler, B.M., Chao, W.L., Su, Y.: Llm- planner: Few-shot grounded planning for embodied agents with large language models. In: ICCV (2023) 1

  52. [52]

    Springer Nature (2022) 20

    Szeliski, R.: Computer vision: algorithms and applications. Springer Nature (2022) 20

  53. [53]

    arXiv preprint arXiv:2601.14352 (2026) 2 Why Far Looks Up 19

    Tan, H., Zhou, E., Li, Z., Xu, Y., Ji, Y., Chen, X., Chi, C., Wang, P., Jia, H., Ao, Y., Cao, M., Chen, S., Li, Z., Liu, M., Wang, Z., Rong, S., Lyu, Y., Zhao, Z., Co, P., Li, Y., Han, Y., Xie, S., Yao, G., Wang, S., Zhang, L., Yang, X., Jiao, Y., Shi, D., Xie, K., Nie, S., Men, C., Lin, Y., Wang, Z., Huang, T., Zhang, S.: Robobrain 2.5: Depth in sight, t...

  54. [54]

    Gemini Robotics: Bringing AI into the Physical World

    Team, G.R., Abeyruwan, S., Ainslie, J., Alayrac, J.B., Arenas, M.G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., et al.: Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020 (2025) 1, 4

  55. [55]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https:// openreview.net/forum?id=Vi8AepAXGy2, 3, 4, 6, 24, 32

    Tong, S., II, E.L.B., Wu, P., Woo, S., IYER, A.J., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., Pan, X., Fergus, R., LeCun, Y., Xie, S.: Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https:// openreview.net/forum?id=Vi8AepAXGy2, ...

  56. [56]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 32

  57. [57]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Wang, X., Ma, W., Zhang, T., de Melo, C.M., Chen, J., Yuille, A.: Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 24669– 24679 (2025) 4

  58. [58]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Wang,Y.,Shi,F.:Logicalformscomplementprobabilityinunderstandinglanguage model (and human) performance. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 16862–16877 (2025) 8

  59. [59]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023) 22

  60. [60]

    arXiv preprint arXiv:2503.22976 (2025) 5, 6, 22, 24

    Zhang, J., Chen, Y., Zhou, Y., Xu, Y., Huang, Z., Mei, J., Chen, J., Yuan, Y.J., Cai, X., Huang, G., et al.: From flatland to space: Teaching vision-language models to perceive and reason in 3d. arXiv preprint arXiv:2503.22976 (2025) 5, 6, 22, 24

  61. [61]

    Zhang, W., Huang, Y., Xu, Y., Huang, J., Zhi, H., Ren, S., Xu, W., Zhang, J.: Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture (2025),https://arxiv.org/abs/2509.023593, 4

  62. [62]

    In: The Thirteenth International Conference on Learning Rep- resentations (2025),https://openreview.net/forum?id=84pDoCD4lH3, 8

    Zhang, Z., Hu, F., Lee, J., Shi, F., Kordjamshidi, P., Chai, J., Ma, Z.: Do vision- language models represent space and how? evaluating spatial frame of reference under ambiguities. In: The Thirteenth International Conference on Learning Rep- resentations (2025),https://openreview.net/forum?id=84pDoCD4lH3, 8

  63. [63]

    In: European Conference on Computer Vision

    Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3d: A large photo-realistic dataset for structured 3d modeling. In: European Conference on Computer Vision. pp. 519–535. Springer (2020) 22

  64. [64]

    International journal of computer vision127(3), 302–321 (2019) 24

    Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International journal of computer vision127(3), 302–321 (2019) 24

  65. [65]

    Is the {obj1} closer to the camera than the {obj2}?

    Zhou, E., An, J., Chi, C., Han, Y., Rong, S., Zhang, C., Wang, P., Wang, Z., Huang, T., Sheng, L., Zhang, S.: Roborefer: Towards spatial referring with reasoning in vision-language models for robotics. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=OGxalNUHbJ2, 4, 5, 6, 7, 9, 11, 13,...