Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

Cheng-Hao Kuo; Luca Carlone; Martin Labrie; Muyuan Lin; Roberto Martin-Martin; Shreekant Gayaka; Yurou Yang

arxiv: 2605.24642 · v1 · pith:C73Z2YA6new · submitted 2026-05-23 · 💻 cs.CV · cs.RO

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

Yurou Yang , Muyuan Lin , Roberto Martin-Martin , Martin Labrie , Shreekant Gayaka , Cheng-Hao Kuo , Luca Carlone This is my paper

Pith reviewed 2026-06-30 13:38 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords vision-language-action modelsgeometric foundation modelslinear probinggeometric understanding3D reconstructionarchitecture comparisonrobot learning

0 comments

The pith

Current VLAs lack measurable geometric understanding that can be quantified via linear probing and bridged through specific injection architectures from GFMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether vision-language-action models already contain enough geometric understanding for 3D-aware tasks. It formalizes the geometric gap using linear probing between a VLA and a GFM, then compares three architectures that differ in how geometric features are injected while holding other factors fixed. The analysis also measures how training data size, camera count, and reconstruction quality change the resulting performance. A reader would care because better geometric grounding could improve spatial reasoning in robotic manipulation without requiring entirely new model families. The work treats these questions as open empirical issues rather than settled assumptions.

Core claim

The paper establishes that VLAs exhibit a quantifiable geometric gap relative to GFMs, that three distinct injection architectures produce different performance effects when geometry is added to a VLA, and that non-architectural factors including training data, camera number, and reconstruction quality also influence the success of the resulting geometric VLAs.

What carries the argument

Linear probing of VLA features to measure the geometric gap, together with three injection architectures that differ in the stage and manner of geometry feature addition from the GFM.

If this is right

Linear probing can now be used to quantify the geometric gap between any VLA and GFM pair.
Architecture choice for geometry injection is not interchangeable and produces measurable differences in final VLA performance.
Increasing training data, camera count, or reconstruction quality each provides an independent lever for improving geometric VLAs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same probing method could be applied to future VLAs to decide whether geometry injection is still needed.
Results may generalize to other 3D-aware robotics benchmarks that rely on spatial consistency.
Designers could prioritize reconstruction quality over camera count when compute is limited, based on the relative effect sizes reported.

Load-bearing premise

That linear probing performed on GR00T-N1.5 and VGGT supplies a representative measure of geometric understanding that extends to other VLAs, GFMs, and downstream tasks.

What would settle it

A replication that applies the same linear probing protocol to additional VLAs and GFMs and finds either no consistent gap or that all three injection architectures yield statistically identical performance would undermine the central claims.

Figures

Figures reproduced from arXiv: 2605.24642 by Cheng-Hao Kuo, Luca Carlone, Martin Labrie, Muyuan Lin, Roberto Martin-Martin, Shreekant Gayaka, Yurou Yang.

**Figure 2.** Figure 2: Sample linear probing results: (a) RBG input, (b) ground truth (GT) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: (a) VGGT depth error vs. manipulation success rate in RoboCasa. (b) [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

read the original abstract

Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work's intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the "geometric gap" between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Controlled experiments on GR00T-N1.5 plus VGGT quantify a gap and compare three injection points, but the broader claim about current VLAs rests on a single model pair.

read the letter

The paper runs linear probing on GR00T-N1.5 against VGGT to measure a geometric gap, then compares three architectures for injecting the GFM features while holding other details fixed. It also ablates training data size, camera count, and reconstruction quality.

The controlled architecture comparison is the clearest contribution. Keeping low-level details similar across the three variants lets the reader see which injection strategy moves performance and by how much. The ablations on data and cameras add practical numbers that people tuning these systems can use directly.

The soft spot is scope. The abstract frames the work as addressing questions about current VLAs in general, yet all probing and comparisons stay inside GR00T-N1.5 and VGGT. No other VLAs are tested, so it is not shown that the measured gap or the ranking of architectures would hold for OpenVLA or RT-X variants. Linear probing itself only checks what a linear head can read out; it does not test whether the VLA actually uses that geometry when acting.

This is for researchers already working on geometric VLAs who need data on fusion choices for these models. It is not a general result or a new method.

I would send it to peer review. The experimental design is concrete enough that referees can evaluate the numbers and the limits of the claims.

Referee Report

1 major / 2 minor

Summary. The paper claims that current VLAs lack sufficient geometric understanding, formalized via linear probing that quantifies a 'geometric gap' between a VLA (GR00T-N1.5) and a GFM (VGGT). It compares three architectures for injecting geometry from GFMs into VLAs under controlled conditions and ablates the effects of training data, camera count, and reconstruction quality on downstream performance.

Significance. If the results hold, the work supplies the first quantitative linear-probing measure of the geometric gap and a controlled comparison of injection strategies, which could guide future geometric VLA design. The emphasis on keeping low-level implementation details similar across architectures is a methodological strength that supports fair comparisons.

major comments (1)

[Abstract, §3] Abstract and §3 (Linear Probing Analysis): The central claim that 'current VLAs lack geometric understanding' rests on linear probing performed exclusively with GR00T-N1.5 and VGGT. No other VLAs (e.g., OpenVLA or RT-X variants) are probed, and no argument is supplied that GR00T-N1.5's feature geometry is representative of the broader class. This single-model scope directly limits whether the reported 'geometric gap' generalizes, making the headline claim load-bearing on an untested assumption.

minor comments (2)

[§4] §4 (Architecture Variants): A diagram or table explicitly listing the feature-injection points and parameter counts for the three architectures would improve clarity of the controlled comparison.
[Figures] Figure captions throughout: Several figures lack explicit mention of the exact layers or tokens used for linear probing, which would aid reproducibility of the gap quantification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the concern regarding the scope of the linear probing analysis below, noting that the manuscript already frames the study around a specific VLA and GFM while providing a controlled comparison of injection strategies.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Linear Probing Analysis): The central claim that 'current VLAs lack geometric understanding' rests on linear probing performed exclusively with GR00T-N1.5 and VGGT. No other VLAs (e.g., OpenVLA or RT-X variants) are probed, and no argument is supplied that GR00T-N1.5's feature geometry is representative of the broader class. This single-model scope directly limits whether the reported 'geometric gap' generalizes, making the headline claim load-bearing on an untested assumption.

Authors: We thank the referee for highlighting this scope limitation. The manuscript is explicit from the abstract onward that the linear probing, architectural comparisons, and ablations are performed for the specific VLA GR00T-N1.5 and GFM VGGT. GR00T-N1.5 is a recent high-capacity transformer-based VLA, and the linear probing protocol is designed to be model-agnostic so that the quantified gap serves as a case study. However, we agree that no explicit argument for representativeness is supplied. In the revised manuscript we will (i) add a short discussion in §3 on shared architectural traits (vision-language backbone, tokenization) with other VLAs such as OpenVLA, and (ii) qualify the abstract claim to refer to “modern VLAs such as GR00T-N1.5.” We view this as a partial revision because we do not introduce new probing experiments on additional VLAs, which would require substantial additional compute outside the current controlled setting. revision: partial

Circularity Check

0 steps flagged

No circularity: purely experimental comparisons with no derivations or fitted predictions

full rationale

The paper conducts linear probing to quantify a geometric gap, compares three injection architectures, and runs ablations on data/camera/reconstruction factors. All steps are empirical measurements on GR00T-N1.5 + VGGT; no equations, no parameter fitting presented as prediction, and no self-citation chains invoked to justify uniqueness or ansatzes. The central claim rests on experimental outcomes rather than any reduction to inputs by construction. Generalization concerns (one VLA/GFM pair) affect external validity but do not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work relies on standard experimental practices in machine learning.

pith-pipeline@v0.9.1-grok · 5804 in / 1109 out tokens · 34145 ms · 2026-06-30T13:38:16.119790+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 36 canonical work pages · 16 internal anchors

[1]

arXiv preprint arXiv:2509.14117 (2025)

Abouzeid, A., Mansour, M., Sun, Z., Song, D.: GeoAware-VLA: Implicit geometry aware vision-language-action model. arXiv preprint arXiv:2509.14117 (2025)

work page arXiv 2025
[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Bjorck, J., Blukis, V., Casta˜ neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L.J., Fang, Y., Fox, D., et al.: GR00T-N1.5: An improved open foundation model for gen- eralist humanoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_ 5/(2025)

2025
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Casta˜ neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: GR00T-N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

In: Robotics: Science and Systems (RSS) (2025)

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: A Vision-Language-Action flow model for general robot control. In: Robotics: Science and Systems (RSS) (2025)

2025
[7]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Cai, W., Ponomarenko, I., Yuan, J., Li, X., Yang, W., Dong, H., Zhao, B.: Spa- tialbot: Precise spatial understanding with vision language models. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 9490–9498. IEEE (2025)

2025
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialVLM: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 14455–14465 (June 2024)

2024
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, S., Garcia, R., Laptev, I., Schmid, C.: Sugar: Pre-training 3d visual repre- sentations for robotics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18049–18060 (2024)

2024
[11]

In: Robotics: Science and Systems (RSS) (2023)

Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023)

2023
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

In: Proceedings of the 40th International Conference on Machine Learning

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: PaLM-E: an embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. pp. 8469–8488 (2023)

2023
[14]

arXiv preprint arXiv:2507.16290 (2025)

Fang, X., Gao, J., Wang, Z., Chen, Z., Ren, X., Lyu, J., Ren, Q., Yang, Z., Yang, X., Yan, Y., Lyu, C.: Dens3R: A foundation model for 3d geometry prediction. arXiv preprint arXiv:2507.16290 (2025)

work page arXiv 2025
[15]

Figure AI: Helix: A vision-language-action model for generalist humanoid control (February 20 2025),https://www.figure.ai/news/helix, accessed: 2026-01-21 Understanding the Impact of Geometric Foundation Models on VLAs 17

2025
[16]

arXiv preprint arXiv:2509.18778 (2025)

Ge, S., Zhang, Y., Xie, S., Zhang, W., Zhou, M., Wang, Z.: VGGT-DP: Generaliz- able robot control via vision foundation models. arXiv preprint arXiv:2509.18778 (2025)

work page arXiv 2025
[17]

arXiv preprint arXiv:2406.08545 (2024)

Goyal, A., Blukis, V., Xu, J., Guo, Y., Chao, Y.W., Fox, D.: RVT-2: Learning precise manipulation from few demonstrations. arXiv preprint arXiv:2406.08545 (2024)

work page arXiv 2024
[18]

arXiv preprint arXiv:2410.15549 (2024)

Han, B., Kim, J., Jang, J.: A dual process VLA: Efficient robotic manipulation leveraging VLM. arXiv preprint arXiv:2410.15549 (2024)

work page arXiv 2024
[19]

In: International Conference on Learning Representations (2022)

Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

2022
[20]

In: IEEE Conf

Hu, W., Lin, J., Long, Y., Ran, Y., Jiang, L., Wang, Y., Zhu, C., Xu, R., Wang, T., Pang, J.: G 2VLM: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2026)

2026
[21]

arXiv preprint arXiv:2511.19971 (2025)

Hu, Y., Cheng, C., Yu, S., Guo, X., Wang, H.: VGGT4D: Mining motion cues in visual geometry transformers for 4d scene reconstruction. arXiv preprint arXiv:2511.19971 (2025)

work page arXiv 2025
[22]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Hu, Y., Guo, Y., Wang, P., Chen, X., Wang, Y.J., Zhang, J., Sreenath, K., Lu, C., Chen, J.: Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

arXiv preprint arXiv:2411.18623 (2024)

Jia, Y., Liu, J., Chen, S., Gu, C., Wang, Z., Luo, L., Lee, L., Wang, P., Wang, Z., Zhang, R., et al.: Lift3d foundation policy: Lifting 2d large-scale pre-trained models for robust 3d robotic manipulation. arXiv preprint arXiv:2411.18623 (2024)

work page arXiv 2024
[24]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., M¨ uller, N., Sch¨ onberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Rota Bul` o, S., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: OpenVLA: An open- source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Contrastive Representation Regularization for Vision-Language-Action Models

Kim, T., Lee, J., Koo, M., Kim, D., Lee, K., Kim, C., Seo, Y., Shin, J.: Contrastive representation regularization for vision-language-action models. arXiv preprint arXiv:2510.01711 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

In: European Conf

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with MASt3R. In: European Conf. on Computer Vision (ECCV). vol. 15130, pp. 71–91 (2024)

2024
[28]

Pointvla: Injecting the 3d world into vision-language- action models.arXiv preprint arXiv:2503.07511, 2025a

Li, C., Wen, J., Peng, Y., Peng, Y., Feng, F., Zhu, Y.: Pointvla: Injecting the 3d world into vision-language-action models. arXiv preprint arXiv:2503.07511 (2025)

work page arXiv 2025
[29]

arXiv preprint arXiv:2510.12276 (2025)

Li, F., Song, W., Zhao, H., Wang, J., Ding, P., Wang, D., Zeng, L., Li, H.: Spatial forcing: Implicit spatial representation alignment for vision-language-action model. arXiv preprint arXiv:2510.12276 (2025)

work page arXiv 2025
[30]

arXiv preprint arXiv:2507.00416 (2025)

Lin, T., Li, G., Zhong, Y., Zou, Y., Du, Y., Liu, J., Gu, E., Zhao, B.: Evo-0: Vision-language-action model with implicit spatial understanding. arXiv preprint arXiv:2507.00416 (2025)

work page arXiv 2025
[31]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Psychometrika12(2), 153–157 (1947) 18 Yang et al

McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika12(2), 153–157 (1947) 18 Yang et al

1947
[33]

In: Robotics: Science and Systems (RSS) (2024)

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: Robotics: Science and Systems (RSS) (2024)

2024
[34]

arXiv preprint arXiv:2510.15530 (2025)

Ni, Z., He, Y., Qian, L., Mao, J., Fu, F., Sui, W., Su, H., Peng, J., Wang, Z., He, B.: Vo-dp: Semantic-geometric adaptive diffusion policy for vision-only robotic manipulation. arXiv preprint arXiv:2510.15530 (2025)

work page arXiv 2025
[35]

arXiv preprint arXiv:2511.10560 (2025)

Peng, H., Li, H., Dai, Y., Lan, Y., Luo, Y., Qi, T., Zhang, Z., Zhan, Y., Zhang, J., Xu, W., Liu, Z.: Omnivggt: Omni-modality driven visual geometry grounded transformer. arXiv preprint arXiv:2511.10560 (2025)

work page arXiv 2025
[36]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Physical Intelligence: Physical intelligence (π) (2026),https://www.pi.website/, accessed: 2026-01-21

2026
[38]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al.: SpatialVLA: Exploring spatial representations for visual- language-action model. arXiv preprint arXiv:2501.15830 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

arXiv preprint arXiv:2508.16433 (2025)

Rojas, S., Armando, M., Ghamen, B., Weinzaepfel, P., Leroy, V., Rogez, G.: HAMSt3R: Human-aware multi-view stereo 3d reconstruction. arXiv preprint arXiv:2508.16433 (2025)

work page arXiv 2025
[40]

org/abs/2503.10966

Snyder, D., Hancock, A.J., Badithela, A., Dixon, E., Miller, P., Ambrus, R.A., Majumdar, A., Itkina, M., Nishimura, H.: Is your imitation learning policy better than mine? policy comparison with near-optimal stopping (2025),https://arxiv. org/abs/2503.10966

work page arXiv 2025
[41]

arXiv preprint arXiv:2509.20297 (2025)

Steiner, R., Millane, A., Tingdahl, D., Volk, C., Ramasamy, V., Yao, X., Du, P., Pouya, S., Sheng, S.: mindmap: Spatial memory in deep feature maps for 3d action policies. arXiv preprint arXiv:2509.20297 (2025)

work page arXiv 2025
[42]

Tesla, Inc.: Ai & robotics (2026),https://www.tesla.com/en_eu/AI, accessed: 2026-01-21

2026
[43]

TRI LBM Team, Barreiros, J., Beaulieu, A., Bhat, A., Cory, R., Cousineau, E., Dai, H., Fang, C.H., Hashimoto, K., Irshad, M.Z., Itkina, M., Kuppuswamy, N., Lee, K.H., Liu, K., McConachie, D., McMahon, I., Nishimura, H., Phillips-Grafflin, C., Richter, C., Shah, P., Srinivasan, K., Wulfe, B., Xu, C., Zhang, M., Alspach, A., Angeles, M., Arora, K., Guizilin...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

AMB3R: accurate feed-forward metric-scale 3D reconstruction with backend

Wang, H., Agapito, L.: AMB3R: Accurate feed-forward metric-scale 3d reconstruc- tion with backend. arXiv preprint arXiv:2511.20343 (2025)

work page arXiv 2025
[45]

arXiv preprint arXiv:2503.11651 (2025)

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novony, D.: VGGT: Visual geometry grounded transformer. arXiv preprint arXiv:2503.11651 (2025)

work page arXiv 2025
[46]

In: IEEE Conf

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: DUSt3R: Geometric 3d vision made easy. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 20697–20709 (2024) Understanding the Impact of Geometric Foundation Models on VLAs 19

2024
[47]

In: IEEE International Conference on Robotics and Automation (ICRA) (2019)

Wofk, D., Ma, F., Yang, T.J., Karaman, S., Sze, V.: FastDepth: Fast Monocular Depth Estimation on Embedded Systems. In: IEEE International Conference on Robotics and Automation (ICRA) (2019)

2019
[48]

In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-MLLM: Boosting MLLM capa- bilities in visual-based spatial intelligence. In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

2025
[49]

Depth Anything V2

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. arXiv: 2406.09414 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

arXiv preprint arXiv:2510.17439 (2025)

Zhang, Z., Li, H., Dai, Y., Zhu, Z., Zhou, L., Liu, C., Wang, D., Tay, F.E.H., Chen, S., Liu, Z., Liu, Y., Li, X., Zhou, P.: From spatial to actions: Ground- ing vision-language-action model in spatial foundation priors. arXiv preprint arXiv:2510.17439 (2025)

work page arXiv 2025
[52]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3D-VLA: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

Zheng, D., Huang, S., Li, Y., Wang, L.: Learning from videos for 3D world: Enhanc- ing MLLMs with 3D vision geometry priors. In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

2025
[54]

arXiv preprint arXiv:2409.18125 (2024)

Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: LLaVA-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. arXiv preprint arXiv:2409.18125 (2024)

work page arXiv 2024
[55]

+Dynamic Attn Gate

Zust, L., Cabon, Y., Marrie, J., Antsfeld, L., Chidlovskii, B., Revaud, J., Csurka, G.: PanSt3R: Multi-view consistent panoptic segmentation. arXiv preprint arXiv:2506.21348 (2025) 20 Yang et al. A Appendix: Details about Architectures This appendix complements the discussion in Section 3, where we introduced the Early and Late Fusion models, and discusse...

work page arXiv 2025

[1] [1]

arXiv preprint arXiv:2509.14117 (2025)

Abouzeid, A., Mansour, M., Sun, Z., Song, D.: GeoAware-VLA: Implicit geometry aware vision-language-action model. arXiv preprint arXiv:2509.14117 (2025)

work page arXiv 2025

[2] [2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Bjorck, J., Blukis, V., Casta˜ neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L.J., Fang, Y., Fox, D., et al.: GR00T-N1.5: An improved open foundation model for gen- eralist humanoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_ 5/(2025)

2025

[4] [4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Casta˜ neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: GR00T-N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

In: Robotics: Science and Systems (RSS) (2025)

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: A Vision-Language-Action flow model for general robot control. In: Robotics: Science and Systems (RSS) (2025)

2025

[7] [7]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Cai, W., Ponomarenko, I., Yuan, J., Li, X., Yang, W., Dong, H., Zhao, B.: Spa- tialbot: Precise spatial understanding with vision language models. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 9490–9498. IEEE (2025)

2025

[9] [9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialVLM: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 14455–14465 (June 2024)

2024

[10] [10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, S., Garcia, R., Laptev, I., Schmid, C.: Sugar: Pre-training 3d visual repre- sentations for robotics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18049–18060 (2024)

2024

[11] [11]

In: Robotics: Science and Systems (RSS) (2023)

Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023)

2023

[12] [12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

In: Proceedings of the 40th International Conference on Machine Learning

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: PaLM-E: an embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. pp. 8469–8488 (2023)

2023

[14] [14]

arXiv preprint arXiv:2507.16290 (2025)

Fang, X., Gao, J., Wang, Z., Chen, Z., Ren, X., Lyu, J., Ren, Q., Yang, Z., Yang, X., Yan, Y., Lyu, C.: Dens3R: A foundation model for 3d geometry prediction. arXiv preprint arXiv:2507.16290 (2025)

work page arXiv 2025

[15] [15]

Figure AI: Helix: A vision-language-action model for generalist humanoid control (February 20 2025),https://www.figure.ai/news/helix, accessed: 2026-01-21 Understanding the Impact of Geometric Foundation Models on VLAs 17

2025

[16] [16]

arXiv preprint arXiv:2509.18778 (2025)

Ge, S., Zhang, Y., Xie, S., Zhang, W., Zhou, M., Wang, Z.: VGGT-DP: Generaliz- able robot control via vision foundation models. arXiv preprint arXiv:2509.18778 (2025)

work page arXiv 2025

[17] [17]

arXiv preprint arXiv:2406.08545 (2024)

Goyal, A., Blukis, V., Xu, J., Guo, Y., Chao, Y.W., Fox, D.: RVT-2: Learning precise manipulation from few demonstrations. arXiv preprint arXiv:2406.08545 (2024)

work page arXiv 2024

[18] [18]

arXiv preprint arXiv:2410.15549 (2024)

Han, B., Kim, J., Jang, J.: A dual process VLA: Efficient robotic manipulation leveraging VLM. arXiv preprint arXiv:2410.15549 (2024)

work page arXiv 2024

[19] [19]

In: International Conference on Learning Representations (2022)

Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

2022

[20] [20]

In: IEEE Conf

Hu, W., Lin, J., Long, Y., Ran, Y., Jiang, L., Wang, Y., Zhu, C., Xu, R., Wang, T., Pang, J.: G 2VLM: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2026)

2026

[21] [21]

arXiv preprint arXiv:2511.19971 (2025)

Hu, Y., Cheng, C., Yu, S., Guo, X., Wang, H.: VGGT4D: Mining motion cues in visual geometry transformers for 4d scene reconstruction. arXiv preprint arXiv:2511.19971 (2025)

work page arXiv 2025

[22] [22]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Hu, Y., Guo, Y., Wang, P., Chen, X., Wang, Y.J., Zhang, J., Sreenath, K., Lu, C., Chen, J.: Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

arXiv preprint arXiv:2411.18623 (2024)

Jia, Y., Liu, J., Chen, S., Gu, C., Wang, Z., Luo, L., Lee, L., Wang, P., Wang, Z., Zhang, R., et al.: Lift3d foundation policy: Lifting 2d large-scale pre-trained models for robust 3d robotic manipulation. arXiv preprint arXiv:2411.18623 (2024)

work page arXiv 2024

[24] [24]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., M¨ uller, N., Sch¨ onberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Rota Bul` o, S., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: OpenVLA: An open- source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Contrastive Representation Regularization for Vision-Language-Action Models

Kim, T., Lee, J., Koo, M., Kim, D., Lee, K., Kim, C., Seo, Y., Shin, J.: Contrastive representation regularization for vision-language-action models. arXiv preprint arXiv:2510.01711 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

In: European Conf

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with MASt3R. In: European Conf. on Computer Vision (ECCV). vol. 15130, pp. 71–91 (2024)

2024

[28] [28]

Pointvla: Injecting the 3d world into vision-language- action models.arXiv preprint arXiv:2503.07511, 2025a

Li, C., Wen, J., Peng, Y., Peng, Y., Feng, F., Zhu, Y.: Pointvla: Injecting the 3d world into vision-language-action models. arXiv preprint arXiv:2503.07511 (2025)

work page arXiv 2025

[29] [29]

arXiv preprint arXiv:2510.12276 (2025)

Li, F., Song, W., Zhao, H., Wang, J., Ding, P., Wang, D., Zeng, L., Li, H.: Spatial forcing: Implicit spatial representation alignment for vision-language-action model. arXiv preprint arXiv:2510.12276 (2025)

work page arXiv 2025

[30] [30]

arXiv preprint arXiv:2507.00416 (2025)

Lin, T., Li, G., Zhong, Y., Zou, Y., Du, Y., Liu, J., Gu, E., Zhao, B.: Evo-0: Vision-language-action model with implicit spatial understanding. arXiv preprint arXiv:2507.00416 (2025)

work page arXiv 2025

[31] [31]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Psychometrika12(2), 153–157 (1947) 18 Yang et al

McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika12(2), 153–157 (1947) 18 Yang et al

1947

[33] [33]

In: Robotics: Science and Systems (RSS) (2024)

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: Robotics: Science and Systems (RSS) (2024)

2024

[34] [34]

arXiv preprint arXiv:2510.15530 (2025)

Ni, Z., He, Y., Qian, L., Mao, J., Fu, F., Sui, W., Su, H., Peng, J., Wang, Z., He, B.: Vo-dp: Semantic-geometric adaptive diffusion policy for vision-only robotic manipulation. arXiv preprint arXiv:2510.15530 (2025)

work page arXiv 2025

[35] [35]

arXiv preprint arXiv:2511.10560 (2025)

Peng, H., Li, H., Dai, Y., Lan, Y., Luo, Y., Qi, T., Zhang, Z., Zhan, Y., Zhang, J., Xu, W., Liu, Z.: Omnivggt: Omni-modality driven visual geometry grounded transformer. arXiv preprint arXiv:2511.10560 (2025)

work page arXiv 2025

[36] [36]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Physical Intelligence: Physical intelligence (π) (2026),https://www.pi.website/, accessed: 2026-01-21

2026

[38] [38]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al.: SpatialVLA: Exploring spatial representations for visual- language-action model. arXiv preprint arXiv:2501.15830 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

arXiv preprint arXiv:2508.16433 (2025)

Rojas, S., Armando, M., Ghamen, B., Weinzaepfel, P., Leroy, V., Rogez, G.: HAMSt3R: Human-aware multi-view stereo 3d reconstruction. arXiv preprint arXiv:2508.16433 (2025)

work page arXiv 2025

[40] [40]

org/abs/2503.10966

Snyder, D., Hancock, A.J., Badithela, A., Dixon, E., Miller, P., Ambrus, R.A., Majumdar, A., Itkina, M., Nishimura, H.: Is your imitation learning policy better than mine? policy comparison with near-optimal stopping (2025),https://arxiv. org/abs/2503.10966

work page arXiv 2025

[41] [41]

arXiv preprint arXiv:2509.20297 (2025)

Steiner, R., Millane, A., Tingdahl, D., Volk, C., Ramasamy, V., Yao, X., Du, P., Pouya, S., Sheng, S.: mindmap: Spatial memory in deep feature maps for 3d action policies. arXiv preprint arXiv:2509.20297 (2025)

work page arXiv 2025

[42] [42]

Tesla, Inc.: Ai & robotics (2026),https://www.tesla.com/en_eu/AI, accessed: 2026-01-21

2026

[43] [43]

TRI LBM Team, Barreiros, J., Beaulieu, A., Bhat, A., Cory, R., Cousineau, E., Dai, H., Fang, C.H., Hashimoto, K., Irshad, M.Z., Itkina, M., Kuppuswamy, N., Lee, K.H., Liu, K., McConachie, D., McMahon, I., Nishimura, H., Phillips-Grafflin, C., Richter, C., Shah, P., Srinivasan, K., Wulfe, B., Xu, C., Zhang, M., Alspach, A., Angeles, M., Arora, K., Guizilin...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

AMB3R: accurate feed-forward metric-scale 3D reconstruction with backend

Wang, H., Agapito, L.: AMB3R: Accurate feed-forward metric-scale 3d reconstruc- tion with backend. arXiv preprint arXiv:2511.20343 (2025)

work page arXiv 2025

[45] [45]

arXiv preprint arXiv:2503.11651 (2025)

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novony, D.: VGGT: Visual geometry grounded transformer. arXiv preprint arXiv:2503.11651 (2025)

work page arXiv 2025

[46] [46]

In: IEEE Conf

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: DUSt3R: Geometric 3d vision made easy. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 20697–20709 (2024) Understanding the Impact of Geometric Foundation Models on VLAs 19

2024

[47] [47]

In: IEEE International Conference on Robotics and Automation (ICRA) (2019)

Wofk, D., Ma, F., Yang, T.J., Karaman, S., Sze, V.: FastDepth: Fast Monocular Depth Estimation on Embedded Systems. In: IEEE International Conference on Robotics and Automation (ICRA) (2019)

2019

[48] [48]

In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-MLLM: Boosting MLLM capa- bilities in visual-based spatial intelligence. In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

2025

[49] [49]

Depth Anything V2

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. arXiv: 2406.09414 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

arXiv preprint arXiv:2510.17439 (2025)

Zhang, Z., Li, H., Dai, Y., Zhu, Z., Zhou, L., Liu, C., Wang, D., Tay, F.E.H., Chen, S., Liu, Z., Liu, Y., Li, X., Zhou, P.: From spatial to actions: Ground- ing vision-language-action model in spatial foundation priors. arXiv preprint arXiv:2510.17439 (2025)

work page arXiv 2025

[52] [52]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3D-VLA: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

Zheng, D., Huang, S., Li, Y., Wang, L.: Learning from videos for 3D world: Enhanc- ing MLLMs with 3D vision geometry priors. In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

2025

[54] [54]

arXiv preprint arXiv:2409.18125 (2024)

Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: LLaVA-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. arXiv preprint arXiv:2409.18125 (2024)

work page arXiv 2024

[55] [55]

+Dynamic Attn Gate

Zust, L., Cabon, Y., Marrie, J., Antsfeld, L., Chidlovskii, B., Revaud, J., Csurka, G.: PanSt3R: Multi-view consistent panoptic segmentation. arXiv preprint arXiv:2506.21348 (2025) 20 Yang et al. A Appendix: Details about Architectures This appendix complements the discussion in Section 3, where we introduced the Early and Late Fusion models, and discusse...

work page arXiv 2025