pith. sign in

arxiv: 2605.24642 · v1 · pith:C73Z2YA6new · submitted 2026-05-23 · 💻 cs.CV · cs.RO

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

Pith reviewed 2026-06-30 13:38 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords vision-language-action modelsgeometric foundation modelslinear probinggeometric understanding3D reconstructionarchitecture comparisonrobot learning
0
0 comments X

The pith

Current VLAs lack measurable geometric understanding that can be quantified via linear probing and bridged through specific injection architectures from GFMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether vision-language-action models already contain enough geometric understanding for 3D-aware tasks. It formalizes the geometric gap using linear probing between a VLA and a GFM, then compares three architectures that differ in how geometric features are injected while holding other factors fixed. The analysis also measures how training data size, camera count, and reconstruction quality change the resulting performance. A reader would care because better geometric grounding could improve spatial reasoning in robotic manipulation without requiring entirely new model families. The work treats these questions as open empirical issues rather than settled assumptions.

Core claim

The paper establishes that VLAs exhibit a quantifiable geometric gap relative to GFMs, that three distinct injection architectures produce different performance effects when geometry is added to a VLA, and that non-architectural factors including training data, camera number, and reconstruction quality also influence the success of the resulting geometric VLAs.

What carries the argument

Linear probing of VLA features to measure the geometric gap, together with three injection architectures that differ in the stage and manner of geometry feature addition from the GFM.

If this is right

  • Linear probing can now be used to quantify the geometric gap between any VLA and GFM pair.
  • Architecture choice for geometry injection is not interchangeable and produces measurable differences in final VLA performance.
  • Increasing training data, camera count, or reconstruction quality each provides an independent lever for improving geometric VLAs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same probing method could be applied to future VLAs to decide whether geometry injection is still needed.
  • Results may generalize to other 3D-aware robotics benchmarks that rely on spatial consistency.
  • Designers could prioritize reconstruction quality over camera count when compute is limited, based on the relative effect sizes reported.

Load-bearing premise

That linear probing performed on GR00T-N1.5 and VGGT supplies a representative measure of geometric understanding that extends to other VLAs, GFMs, and downstream tasks.

What would settle it

A replication that applies the same linear probing protocol to additional VLAs and GFMs and finds either no consistent gap or that all three injection architectures yield statistically identical performance would undermine the central claims.

Figures

Figures reproduced from arXiv: 2605.24642 by Cheng-Hao Kuo, Luca Carlone, Martin Labrie, Muyuan Lin, Roberto Martin-Martin, Shreekant Gayaka, Yurou Yang.

Figure 1
Figure 1. Figure 1: (a) Standard VLA architecture. (b)-(d) Key strategies to inject tokens [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sample linear probing results: (a) RBG input, (b) ground truth (GT) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) VGGT depth error vs. manipulation success rate in RoboCasa. (b) [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work's intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the "geometric gap" between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that current VLAs lack sufficient geometric understanding, formalized via linear probing that quantifies a 'geometric gap' between a VLA (GR00T-N1.5) and a GFM (VGGT). It compares three architectures for injecting geometry from GFMs into VLAs under controlled conditions and ablates the effects of training data, camera count, and reconstruction quality on downstream performance.

Significance. If the results hold, the work supplies the first quantitative linear-probing measure of the geometric gap and a controlled comparison of injection strategies, which could guide future geometric VLA design. The emphasis on keeping low-level implementation details similar across architectures is a methodological strength that supports fair comparisons.

major comments (1)
  1. [Abstract, §3] Abstract and §3 (Linear Probing Analysis): The central claim that 'current VLAs lack geometric understanding' rests on linear probing performed exclusively with GR00T-N1.5 and VGGT. No other VLAs (e.g., OpenVLA or RT-X variants) are probed, and no argument is supplied that GR00T-N1.5's feature geometry is representative of the broader class. This single-model scope directly limits whether the reported 'geometric gap' generalizes, making the headline claim load-bearing on an untested assumption.
minor comments (2)
  1. [§4] §4 (Architecture Variants): A diagram or table explicitly listing the feature-injection points and parameter counts for the three architectures would improve clarity of the controlled comparison.
  2. [Figures] Figure captions throughout: Several figures lack explicit mention of the exact layers or tokens used for linear probing, which would aid reproducibility of the gap quantification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the concern regarding the scope of the linear probing analysis below, noting that the manuscript already frames the study around a specific VLA and GFM while providing a controlled comparison of injection strategies.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (Linear Probing Analysis): The central claim that 'current VLAs lack geometric understanding' rests on linear probing performed exclusively with GR00T-N1.5 and VGGT. No other VLAs (e.g., OpenVLA or RT-X variants) are probed, and no argument is supplied that GR00T-N1.5's feature geometry is representative of the broader class. This single-model scope directly limits whether the reported 'geometric gap' generalizes, making the headline claim load-bearing on an untested assumption.

    Authors: We thank the referee for highlighting this scope limitation. The manuscript is explicit from the abstract onward that the linear probing, architectural comparisons, and ablations are performed for the specific VLA GR00T-N1.5 and GFM VGGT. GR00T-N1.5 is a recent high-capacity transformer-based VLA, and the linear probing protocol is designed to be model-agnostic so that the quantified gap serves as a case study. However, we agree that no explicit argument for representativeness is supplied. In the revised manuscript we will (i) add a short discussion in §3 on shared architectural traits (vision-language backbone, tokenization) with other VLAs such as OpenVLA, and (ii) qualify the abstract claim to refer to “modern VLAs such as GR00T-N1.5.” We view this as a partial revision because we do not introduce new probing experiments on additional VLAs, which would require substantial additional compute outside the current controlled setting. revision: partial

Circularity Check

0 steps flagged

No circularity: purely experimental comparisons with no derivations or fitted predictions

full rationale

The paper conducts linear probing to quantify a geometric gap, compares three injection architectures, and runs ablations on data/camera/reconstruction factors. All steps are empirical measurements on GR00T-N1.5 + VGGT; no equations, no parameter fitting presented as prediction, and no self-citation chains invoked to justify uniqueness or ansatzes. The central claim rests on experimental outcomes rather than any reduction to inputs by construction. Generalization concerns (one VLA/GFM pair) affect external validity but do not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work relies on standard experimental practices in machine learning.

pith-pipeline@v0.9.1-grok · 5804 in / 1109 out tokens · 34145 ms · 2026-06-30T13:38:16.119790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 36 canonical work pages · 16 internal anchors

  1. [1]

    arXiv preprint arXiv:2509.14117 (2025)

    Abouzeid, A., Mansour, M., Sun, Z., Song, D.: GeoAware-VLA: Implicit geometry aware vision-language-action model. arXiv preprint arXiv:2509.14117 (2025)

  2. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

  3. [3]

    Bjorck, J., Blukis, V., Casta˜ neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L.J., Fang, Y., Fox, D., et al.: GR00T-N1.5: An improved open foundation model for gen- eralist humanoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_ 5/(2025)

  4. [4]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Casta˜ neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: GR00T-N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  5. [5]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

  6. [6]

    In: Robotics: Science and Systems (RSS) (2025)

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: A Vision-Language-Action flow model for general robot control. In: Robotics: Science and Systems (RSS) (2025)

  7. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

  8. [8]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    Cai, W., Ponomarenko, I., Yuan, J., Li, X., Yang, W., Dong, H., Zhao, B.: Spa- tialbot: Precise spatial understanding with vision language models. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 9490–9498. IEEE (2025)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialVLM: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 14455–14465 (June 2024)

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, S., Garcia, R., Laptev, I., Schmid, C.: Sugar: Pre-training 3d visual repre- sentations for robotics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18049–18060 (2024)

  11. [11]

    In: Robotics: Science and Systems (RSS) (2023)

    Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023)

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  13. [13]

    In: Proceedings of the 40th International Conference on Machine Learning

    Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: PaLM-E: an embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. pp. 8469–8488 (2023)

  14. [14]

    arXiv preprint arXiv:2507.16290 (2025)

    Fang, X., Gao, J., Wang, Z., Chen, Z., Ren, X., Lyu, J., Ren, Q., Yang, Z., Yang, X., Yan, Y., Lyu, C.: Dens3R: A foundation model for 3d geometry prediction. arXiv preprint arXiv:2507.16290 (2025)

  15. [15]

    Figure AI: Helix: A vision-language-action model for generalist humanoid control (February 20 2025),https://www.figure.ai/news/helix, accessed: 2026-01-21 Understanding the Impact of Geometric Foundation Models on VLAs 17

  16. [16]

    arXiv preprint arXiv:2509.18778 (2025)

    Ge, S., Zhang, Y., Xie, S., Zhang, W., Zhou, M., Wang, Z.: VGGT-DP: Generaliz- able robot control via vision foundation models. arXiv preprint arXiv:2509.18778 (2025)

  17. [17]

    arXiv preprint arXiv:2406.08545 (2024)

    Goyal, A., Blukis, V., Xu, J., Guo, Y., Chao, Y.W., Fox, D.: RVT-2: Learning precise manipulation from few demonstrations. arXiv preprint arXiv:2406.08545 (2024)

  18. [18]

    arXiv preprint arXiv:2410.15549 (2024)

    Han, B., Kim, J., Jang, J.: A dual process VLA: Efficient robotic manipulation leveraging VLM. arXiv preprint arXiv:2410.15549 (2024)

  19. [19]

    In: International Conference on Learning Representations (2022)

    Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

  20. [20]

    In: IEEE Conf

    Hu, W., Lin, J., Long, Y., Ran, Y., Jiang, L., Wang, Y., Zhu, C., Xu, R., Wang, T., Pang, J.: G 2VLM: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2026)

  21. [21]

    arXiv preprint arXiv:2511.19971 (2025)

    Hu, Y., Cheng, C., Yu, S., Guo, X., Wang, H.: VGGT4D: Mining motion cues in visual geometry transformers for 4d scene reconstruction. arXiv preprint arXiv:2511.19971 (2025)

  22. [22]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Hu, Y., Guo, Y., Wang, P., Chen, X., Wang, Y.J., Zhang, J., Sreenath, K., Lu, C., Chen, J.: Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803 (2024)

  23. [23]

    arXiv preprint arXiv:2411.18623 (2024)

    Jia, Y., Liu, J., Chen, S., Gu, C., Wang, Z., Luo, L., Lee, L., Wang, P., Wang, Z., Zhang, R., et al.: Lift3d foundation policy: Lifting 2d large-scale pre-trained models for robust 3d robotic manipulation. arXiv preprint arXiv:2411.18623 (2024)

  24. [24]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Keetha, N., M¨ uller, N., Sch¨ onberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Rota Bul` o, S., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

  25. [25]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: OpenVLA: An open- source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  26. [26]

    Contrastive Representation Regularization for Vision-Language-Action Models

    Kim, T., Lee, J., Koo, M., Kim, D., Lee, K., Kim, C., Seo, Y., Shin, J.: Contrastive representation regularization for vision-language-action models. arXiv preprint arXiv:2510.01711 (2025)

  27. [27]

    In: European Conf

    Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with MASt3R. In: European Conf. on Computer Vision (ECCV). vol. 15130, pp. 71–91 (2024)

  28. [28]

    Pointvla: Injecting the 3d world into vision-language- action models.arXiv preprint arXiv:2503.07511, 2025a

    Li, C., Wen, J., Peng, Y., Peng, Y., Feng, F., Zhu, Y.: Pointvla: Injecting the 3d world into vision-language-action models. arXiv preprint arXiv:2503.07511 (2025)

  29. [29]

    arXiv preprint arXiv:2510.12276 (2025)

    Li, F., Song, W., Zhao, H., Wang, J., Ding, P., Wang, D., Zeng, L., Li, H.: Spatial forcing: Implicit spatial representation alignment for vision-language-action model. arXiv preprint arXiv:2510.12276 (2025)

  30. [30]

    arXiv preprint arXiv:2507.00416 (2025)

    Lin, T., Li, G., Zhong, Y., Zou, Y., Du, Y., Liu, J., Gu, E., Zhao, B.: Evo-0: Vision-language-action model with implicit spatial understanding. arXiv preprint arXiv:2507.00416 (2025)

  31. [31]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310 (2023)

  32. [32]

    Psychometrika12(2), 153–157 (1947) 18 Yang et al

    McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika12(2), 153–157 (1947) 18 Yang et al

  33. [33]

    In: Robotics: Science and Systems (RSS) (2024)

    Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: Robotics: Science and Systems (RSS) (2024)

  34. [34]

    arXiv preprint arXiv:2510.15530 (2025)

    Ni, Z., He, Y., Qian, L., Mao, J., Fu, F., Sui, W., Su, H., Peng, J., Wang, Z., He, B.: Vo-dp: Semantic-geometric adaptive diffusion policy for vision-only robotic manipulation. arXiv preprint arXiv:2510.15530 (2025)

  35. [35]

    arXiv preprint arXiv:2511.10560 (2025)

    Peng, H., Li, H., Dai, Y., Lan, Y., Luo, Y., Qi, T., Zhang, Z., Zhan, Y., Zhang, J., Xu, W., Liu, Z.: Omnivggt: Omni-modality driven visual geometry grounded transformer. arXiv preprint arXiv:2511.10560 (2025)

  36. [36]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

  37. [37]

    Physical Intelligence: Physical intelligence (π) (2026),https://www.pi.website/, accessed: 2026-01-21

  38. [38]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al.: SpatialVLA: Exploring spatial representations for visual- language-action model. arXiv preprint arXiv:2501.15830 (2025)

  39. [39]

    arXiv preprint arXiv:2508.16433 (2025)

    Rojas, S., Armando, M., Ghamen, B., Weinzaepfel, P., Leroy, V., Rogez, G.: HAMSt3R: Human-aware multi-view stereo 3d reconstruction. arXiv preprint arXiv:2508.16433 (2025)

  40. [40]

    org/abs/2503.10966

    Snyder, D., Hancock, A.J., Badithela, A., Dixon, E., Miller, P., Ambrus, R.A., Majumdar, A., Itkina, M., Nishimura, H.: Is your imitation learning policy better than mine? policy comparison with near-optimal stopping (2025),https://arxiv. org/abs/2503.10966

  41. [41]

    arXiv preprint arXiv:2509.20297 (2025)

    Steiner, R., Millane, A., Tingdahl, D., Volk, C., Ramasamy, V., Yao, X., Du, P., Pouya, S., Sheng, S.: mindmap: Spatial memory in deep feature maps for 3d action policies. arXiv preprint arXiv:2509.20297 (2025)

  42. [42]

    Tesla, Inc.: Ai & robotics (2026),https://www.tesla.com/en_eu/AI, accessed: 2026-01-21

  43. [43]

    TRI LBM Team, Barreiros, J., Beaulieu, A., Bhat, A., Cory, R., Cousineau, E., Dai, H., Fang, C.H., Hashimoto, K., Irshad, M.Z., Itkina, M., Kuppuswamy, N., Lee, K.H., Liu, K., McConachie, D., McMahon, I., Nishimura, H., Phillips-Grafflin, C., Richter, C., Shah, P., Srinivasan, K., Wulfe, B., Xu, C., Zhang, M., Alspach, A., Angeles, M., Arora, K., Guizilin...

  44. [44]

    AMB3R: accurate feed-forward metric-scale 3D reconstruction with backend

    Wang, H., Agapito, L.: AMB3R: Accurate feed-forward metric-scale 3d reconstruc- tion with backend. arXiv preprint arXiv:2511.20343 (2025)

  45. [45]

    arXiv preprint arXiv:2503.11651 (2025)

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novony, D.: VGGT: Visual geometry grounded transformer. arXiv preprint arXiv:2503.11651 (2025)

  46. [46]

    In: IEEE Conf

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: DUSt3R: Geometric 3d vision made easy. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 20697–20709 (2024) Understanding the Impact of Geometric Foundation Models on VLAs 19

  47. [47]

    In: IEEE International Conference on Robotics and Automation (ICRA) (2019)

    Wofk, D., Ma, F., Yang, T.J., Karaman, S., Sze, V.: FastDepth: Fast Monocular Depth Estimation on Embedded Systems. In: IEEE International Conference on Robotics and Automation (ICRA) (2019)

  48. [48]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

    Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-MLLM: Boosting MLLM capa- bilities in visual-based spatial intelligence. In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

  49. [49]

    Depth Anything V2

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. arXiv: 2406.09414 (2024)

  50. [50]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)

  51. [51]

    arXiv preprint arXiv:2510.17439 (2025)

    Zhang, Z., Li, H., Dai, Y., Zhu, Z., Zhou, L., Liu, C., Wang, D., Tay, F.E.H., Chen, S., Liu, Z., Liu, Y., Li, X., Zhou, P.: From spatial to actions: Ground- ing vision-language-action model in spatial foundation priors. arXiv preprint arXiv:2510.17439 (2025)

  52. [52]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3D-VLA: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 (2024)

  53. [53]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

    Zheng, D., Huang, S., Li, Y., Wang, L.: Learning from videos for 3D world: Enhanc- ing MLLMs with 3D vision geometry priors. In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

  54. [54]

    arXiv preprint arXiv:2409.18125 (2024)

    Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: LLaVA-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. arXiv preprint arXiv:2409.18125 (2024)

  55. [55]

    +Dynamic Attn Gate

    Zust, L., Cabon, Y., Marrie, J., Antsfeld, L., Chidlovskii, B., Revaud, J., Csurka, G.: PanSt3R: Multi-view consistent panoptic segmentation. arXiv preprint arXiv:2506.21348 (2025) 20 Yang et al. A Appendix: Details about Architectures This appendix complements the discussion in Section 3, where we introduced the Early and Late Fusion models, and discusse...