pith. machine review for the scientific record. sign in

arxiv: 2604.12908 · v1 · submitted 2026-04-14 · 💻 cs.RO

Recognition: unknown

Robotic Manipulation is Vision-to-Geometry Mapping (f(v) rightarrow G): Vision-Geometry Backbones over Language and Video Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic manipulationvision-geometry mapping3D world modelsVGA modelzero-shot generalizationphysical actionsmanipulation tasks
0
0 comments X

The pith

Robotic manipulation is a vision-to-geometry mapping problem best solved by pretrained 3D world models rather than language or video backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that physical actions depend on exact 3D positions and spatial relations, so generalizable robot control needs backbones that start from native 3D geometry instead of semantic language data or 2D video sequences. Current vision-language and video-predictive models rely on representations shaped by 2D priors or concepts that do not match the geometric demands of manipulation. The authors introduce the Vision-Geometry-Action model that substitutes a pretrained 3D world model for those backbones, adds a Progressive Volumetric Modulation module for consistency, and trains jointly to produce actions directly from visual inputs. Experiments show this yields higher precision in simulation than leading VLA baselines and stronger zero-shot performance on real robots facing new viewpoints.

Core claim

At its core, robotic manipulation is a problem of vision-to-geometry mapping. The Vision-Geometry-Action model replaces conventional language or video backbones with a pretrained native 3D world model to translate visual inputs directly into physical actions, supported by Progressive Volumetric Modulation and joint training for geometric consistency.

What carries the argument

Vision-Geometry-Action (VGA) model that conditions action generation on pretrained 3D world model representations to create a direct vision-to-geometry mapping.

If this is right

  • VGA outperforms top-tier VLA baselines including π0.5 and GeoVLA on simulation benchmarks for precise manipulation.
  • VGA achieves stronger zero-shot generalization to unseen viewpoints in real-world robot deployments than π0.5.
  • Direct operation on native 3D representations, rather than translation through language or 2D priors, supports more generalizable physical intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotics foundation models may shift priority toward 3D geometry sources over multimodal language training for spatial tasks.
  • The same 3D backbone approach could apply to navigation or assembly problems that also hinge on precise spatial relations.
  • Reduced dependence on language intermediaries might simplify training pipelines when only geometric accuracy matters.

Load-bearing premise

Native 3D geometric representations align better with the spatial requirements of physical actions than representations shaped by language semantics or 2D video priors.

What would settle it

A controlled test in which a vision-language model achieves equal or higher success rates than VGA on precise manipulation tasks under novel real-world viewpoints would disprove the claimed superiority.

Figures

Figures reproduced from arXiv: 2604.12908 by Guangrun Wang, Jiawei Zhou, Liang Lin, Qichang Li, Tianshui Chen, Zhenlong Yuan, Zijian Song.

Figure 1
Figure 1. Figure 1: Robotic manipulation as vision-to-geometry mapping (𝑓 (𝑣) → 𝐺). Physical actions like reaching, grasping, and orienting are inherently driven by geometric properties, such as 3D position, rotation, and spatial relationships. Therefore, we argue that a vision-geometry backbone provides a stronger foundation for generalizable robotic control than prevalent vision-language or video models. grasping, and orien… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our VGA model. (a) The left column compares our VGA framework with representative robot learning paradigms. VGA differs from them by leveraging a pretrained 3D world model as the backbone, providing native 3D representa￾tions aligned with physical actions. (b) The right column illustrates the workflow of the VGA model. Multimodal inputs are tokenized into a unified sequence and processed by a p… view at source ↗
Figure 3
Figure 3. Figure 3: Simulation rollouts and depth predictions on the LIBERO benchmark. The results show that VGA achieves pre￾cise physical manipulation with precise depth predictions. subsequently projected back to the latent space: h (𝑙+1) 𝑑𝑒𝑐 = Linear( [h ′ (𝑙+1) 𝑑𝑒𝑐 , a (𝑙) 𝑑𝑒𝑐 ]), (10) where [·, ·] denotes the concatenation. By interleaving this dual￾stage modulation across all layers, PVM sustains a high-fidelity flow o… view at source ↗
Figure 4
Figure 4. Figure 4: Real world Configuration. The platform is equipped with one wrist camera and two fixed cameras. The fixed cameras are used for in-distribution and out-of￾distribution evaluation, respectively. 6.1 Experimental Setup Our real-world experiments are conducted on a Franka Panda robotic arm equipped with three RealSense D415 cameras. A wrist camera provides observations aligned with the end-effector during mani… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of real world manipulation. Our method demonstrates coherent and stable manipulation behaviors under both seen and unseen viewpoints, highlighting its strong generalization. respectively, indicating stronger performance in real-world execu￾tion. Compared to 𝜋0.5, our method achieves highly competitive results, matching its performance on Press Button and on the more challenging Stack Cube tas… view at source ↗
Figure 6
Figure 6. Figure 6: Language-grounded grasping under varying lay￾outs. This figure presents the results of real-world grasp￾ing with three visually similar objects arranged in different layouts. Each row corresponds to a different spatial configu￾ration, and the robot is instructed to pick a target object. VGA consistently identifies and grasps the correct object regard￾less of its position, demonstrating robust language grou… view at source ↗
Figure 7
Figure 7. Figure 7: Impact of joint training and LoRA rank. Part (a) presents the ablation study on LoRA rank. The results plateau around rank 64. Part (b) compares the convergence speed between models trained with and without joint train￾ing, evaluated using checkpoints from 1K to 5K steps on LIBERO-Spatial. Joint training leads to faster convergence and improved early-stage performance, indicating better data efficiency [P… view at source ↗
Figure 9
Figure 9. Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Initial state variations for evaluation tasks. This figure illustrates the three evaluation tasks under four types of initial condition variations. Despite these diverse and challenging variations, our VGA consistently achieves high manipulation success rates across all settings. outliers characterized by unstable or jerky movements. Addition￾ally, we filtered out trajectories with abnormal durations—eith… view at source ↗
read the original abstract

At its core, robotic manipulation is a problem of vision-to-geometry mapping ($f(v) \rightarrow G$). Physical actions are fundamentally defined by geometric properties like 3D positions and spatial relationships. Consequently, we argue that the foundation for generalizable robotic control should be a vision-geometry backbone, rather than the widely adopted vision-language or video models. Conventional VLA and video-predictive models rely on backbones pretrained on large-scale 2D image-text or temporal pixel data. While effective, their representations are largely shaped by semantic concepts or 2D priors, which do not intrinsically align with the precise 3D geometric nature required for physical manipulation. Driven by this insight, we propose the Vision-Geometry-Action (VGA) model, which directly conditions action generation on pretrained native 3D representations. Specifically, VGA replaces conventional language or video backbones with a pretrained 3D world model, establishing a seamless vision-to-geometry mapping that translates visual inputs directly into physical actions. To further enhance geometric consistency, we introduce a Progressive Volumetric Modulation module and adopt a joint training strategy. Extensive experiments validate the effectiveness of our approach. In simulation benchmarks, VGA outperforms top-tier VLA baselines including $\pi_{0.5}$ and GeoVLA, demonstrating its superiority in precise manipulation. More importantly, VGA exhibits remarkable zero-shot generalization to unseen viewpoints in real-world deployments, consistently outperforming $\pi_{0.5}$. These results highlight that operating on native 3D representations-rather than translating through language or 2D video priors-is a highly promising direction for achieving generalizable physical intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper argues that robotic manipulation is fundamentally a vision-to-geometry mapping problem (f(v) → G) and that native 3D representations should replace vision-language or video backbones for generalizable control. It introduces the Vision-Geometry-Action (VGA) model, which conditions actions on a pretrained 3D world model, adds a Progressive Volumetric Modulation module, and uses joint training; the abstract claims this yields superior performance on simulation benchmarks versus π0.5 and GeoVLA plus zero-shot real-world generalization to unseen viewpoints.

Significance. If the central claim holds and the performance edge is attributable to native 3D geometry rather than ancillary design choices, the work could shift robotic foundation models toward explicit 3D world models, improving precision and viewpoint invariance in manipulation tasks. The absence of ablations, error bars, and dataset details in the reported results, however, leaves the magnitude of this potential contribution uncertain.

major comments (3)
  1. [Abstract] Abstract: the claim that VGA 'outperforms top-tier VLA baselines including π0.5 and GeoVLA' and exhibits 'remarkable zero-shot generalization' cannot be evaluated because no error bars, dataset sizes, number of trials, or statistical tests are provided; without these, it is impossible to determine whether the reported gains are robust or driven by the 3D backbone itself.
  2. [Abstract] Abstract and experimental description: the manuscript does not report ablations that isolate the contribution of the pretrained 3D world model from the Progressive Volumetric Modulation module, the joint training strategy, or differences in the action head and data pipeline. Consequently the central premise—that native 3D representations are the load-bearing factor for the observed gains over π0.5 and GeoVLA—remains untested.
  3. [Abstract] Abstract: the zero-shot viewpoint generalization result is presented as evidence for the superiority of 3D geometry, yet no quantitative metrics (e.g., success rate deltas, viewpoint ranges, or failure modes) or comparison conditions (e.g., 2D backbone with viewpoint augmentation) are supplied, weakening the link between the architectural choice and the claimed generalization.
minor comments (2)
  1. [Abstract] The abstract introduces the notation f(v) → G without defining the symbol G or the precise geometric output space; a short clarification of the target representation would improve readability.
  2. [Abstract] The phrase 'seamless vision-to-geometry mapping' is used repeatedly; a concrete description of how the 3D world model outputs are aligned with the action space (e.g., via specific layers or loss terms) would help readers assess the mapping.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the concerns identify gaps in the current presentation, we have revised the manuscript to incorporate additional details, experiments, and clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that VGA 'outperforms top-tier VLA baselines including π0.5 and GeoVLA' and exhibits 'remarkable zero-shot generalization' cannot be evaluated because no error bars, dataset sizes, number of trials, or statistical tests are provided; without these, it is impossible to determine whether the reported gains are robust or driven by the 3D backbone itself.

    Authors: We agree that the absence of these statistical details in the abstract limits the ability to assess robustness. In the revised manuscript we have expanded the abstract and the experimental results section to report error bars across all benchmarks, the exact number of trials and dataset sizes used, and the results of statistical significance tests. These additions make it possible to evaluate whether the observed improvements are reliable and attributable to the architectural choices. revision: yes

  2. Referee: [Abstract] Abstract and experimental description: the manuscript does not report ablations that isolate the contribution of the pretrained 3D world model from the Progressive Volumetric Modulation module, the joint training strategy, or differences in the action head and data pipeline. Consequently the central premise—that native 3D representations are the load-bearing factor for the observed gains over π0.5 and GeoVLA—remains untested.

    Authors: We acknowledge that isolating the contribution of the pretrained 3D world model is essential to support the central claim. We have added a dedicated ablation study in the revised experimental section that compares the full VGA model against controlled variants: one without the pretrained 3D backbone (replaced by a 2D or language backbone), one without Progressive Volumetric Modulation, and one using separate rather than joint training. The new results show that removing the native 3D representations produces the largest performance drop relative to the VLA baselines, while the other components provide complementary gains. revision: yes

  3. Referee: [Abstract] Abstract: the zero-shot viewpoint generalization result is presented as evidence for the superiority of 3D geometry, yet no quantitative metrics (e.g., success rate deltas, viewpoint ranges, or failure modes) or comparison conditions (e.g., 2D backbone with viewpoint augmentation) are supplied, weakening the link between the architectural choice and the claimed generalization.

    Authors: We agree that stronger quantitative support is needed to link the 3D geometry backbone to viewpoint-invariant generalization. In the revision we have augmented the real-world evaluation section with success-rate deltas across multiple unseen viewpoint ranges, a breakdown of failure modes, and a direct comparison against a 2D backbone baseline trained with explicit viewpoint augmentation. These additions provide concrete evidence that the native 3D representations confer advantages beyond what 2D augmentation alone can achieve. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual premise with independent experimental validation

full rationale

The paper advances a first-principles-style argument that manipulation reduces to vision-to-geometry mapping and therefore favors native 3D backbones over VLA or video models. This premise is stated directly in the abstract and introduction without deriving it from fitted parameters, self-referential equations, or prior self-citations that themselves depend on the target result. VGA is introduced as a concrete architecture (pretrained 3D world model plus Progressive Volumetric Modulation and joint training) whose performance is then measured against baselines. No step equates a 'prediction' to its own training inputs by construction, nor does any load-bearing uniqueness claim collapse to an unverified self-citation. The derivation chain therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper rests on the domain assumption that 3D geometric representations are intrinsically better aligned with manipulation than semantic or 2D priors; the VGA model and Progressive Volumetric Modulation module are new constructs introduced without independent evidence outside the reported experiments.

axioms (1)
  • domain assumption Pretrained 3D world models supply native geometric representations that align directly with physical manipulation requirements.
    Invoked to justify replacing language and video backbones; appears in the motivation and method description.
invented entities (2)
  • Vision-Geometry-Action (VGA) model no independent evidence
    purpose: Direct conditioning of action generation on pretrained 3D representations.
    New architecture proposed to implement the vision-to-geometry mapping.
  • Progressive Volumetric Modulation module no independent evidence
    purpose: Enhance geometric consistency in the 3D representation pipeline.
    Introduced as an additional component to improve the core mapping.

pith-pipeline@v0.9.0 · 5629 in / 1405 out tokens · 36217 ms · 2026-05-10T15:17:04.847698+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 55 canonical work pages · 27 internal anchors

  1. [1]

    Ali Abouzeid, Malak Mansour, Zezhou Sun, and Dezhen Song. 2025. GeoAware- VLA: Implicit Geometry Aware Vision-Language-Action Model.arXiv preprint arXiv:2509.14117(2025)

  2. [2]

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. 2025. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030(2025)

  3. [4]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky

  4. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164

  5. [6]

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al . 2025. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669(2025)

  6. [7]

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. 2025. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111(2025)

  7. [8]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. InProceedings of the International Conference on Computer Vision (ICCV)

  8. [9]

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al . 2025. WorldVLA: Towards Autoregressive Action World Model.arXiv preprint arXiv:2506.21539(2025)

  9. [10]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 14455–14465

  10. [11]

    Guangyan Chen, Meiling Wang, Te Cui, Luojie Yang, Qi Shao, Lin Zhao, Tianle Zhang, Yihang Li, Yi Yang, and Yufeng Yue. 2025. Unifying Latent Action and Latent State Pre-training for Policy Learning from Videos. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11

  11. [12]

    Hongyu Chen, Liang Lin, and Guangrun Wang. 2026. OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling. arXiv:2604.09580 [cs.AI] https://arxiv.org/abs/2604.09580

  12. [13]

    Shizhe Chen, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. 2024. Sugar: Pre-training 3d visual representations for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18049–18060

  13. [14]

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yan- jiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. 2025. Villa- x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682(2025)

  14. [15]

    Yuhao Chen, Zhihao Zhan, Xiaoxin Lin, Zijian Song, Hao Liu, Qinhan Lyu, Yubo Zu, Xiao Chen, Zhiyuan Liu, Tao Pu, et al. 2026. RADAR: Benchmarking Vision- Language-Action Generalization via Real-World Dynamics, Spatial-Physical In- telligence, and Autonomous Evaluation.arXiv preprint arXiv:2602.10980(2026)

  15. [16]

    Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al . 2025. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data. arXiv preprint arXiv:2505.03233(2025)

  16. [17]

    David Ha and Jürgen Schmidhuber. 2018. World models.arXiv preprint arXiv:1803.101222, 3 (2018), 440

  17. [18]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3

  18. [19]

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. 2024. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803(2024)

  19. [20]

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. 2025. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815(2025)

  20. [21]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

  21. [22]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    𝜋0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054 [cs.LG] https://arxiv.org/abs/2504.16054

  22. [23]

    Yueru Jia, Jiaming Liu, Sixiang Chen, Chenyang Gu, Zhilue Wang, Longzan Luo, Lily Lee, Pengwei Wang, Zhongyuan Wang, Renrui Zhang, et al. 2024. Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic manipulation.arXiv preprint arXiv:2411.18623(2024)

  23. [24]

    Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-tuning vision-language- action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645 (2025)

  24. [25]

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. 2026. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163(2026)

  25. [26]

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al

  26. [27]

    Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246(2024)

  27. [28]

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al . 2025. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917 (2025)

  28. [29]

    Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. 2026. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters11, 3 (2026), 2506–2513

  29. [30]

    Haoyuan Li, Yanpeng Zhou, Yufei Gao, Tao Tang, Jianhua Han, Yujie Yuan, Dave Zhenyu Chen, Jiawang Bian, Hang Xu, and Xiaodan Liang. 2025. Does your 3d encoder really work? when pretrain-sft from 2d vlms meets 3d vlms.arXiv preprint arXiv:2506.05318(2025)

  30. [31]

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. 2026. Causal World Modeling for Robot Control.arXiv preprint arXiv:2601.21998(2026)

  31. [32]

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. 2025. Unified video action model.arXiv preprint arXiv:2503.00200(2025)

  32. [33]

    Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, and Guangrun Wang. 2025. VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling.arXiv preprint arXiv:2512.02902(2025)

  33. [34]

    Xiaoqi Li, Liang Heng, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu, Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, et al . 2025. 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation. In9th Annual Conference on Robot Learning

  34. [35]

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

  35. [36]

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. 2025. Video generators are robot policies.arXiv preprint arXiv:2508.00795(2025)

  36. [37]

    Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. 2025. Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883(2025)

  37. [38]

    Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. 2025. Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416(2025)

  38. [39]

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36 (2023), 44776–44791

  39. [40]

    Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin

  40. [41]

    Deconstructing Spatial Intelligence in Vision-Language Models.Authorea Preprints(2025)

  41. [42]

    Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin. 2026. Spatial Intelligence in Vision-Language Models: A Comprehensive Survey. (2026)

  42. [43]

    Haoyun Liu, Jianzhuang Zhao, Xinyuan Chang, Tianle Shi, Chuanzhang Meng, Jiayuan Tan, Feng Xiong, Tong Lin, Dongjie Huo, Mu Xu, et al . 2026. Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models.arXiv preprint arXiv:2603.01766(2026)

  43. [44]

    Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al . 2025. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631(2025)

  44. [45]

    Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, and Yanwei Fu. 2026. ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation.arXiv preprint arXiv:2601.08325(2026)

  45. [46]

    Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. 2025. SpatialLM: Training Large Language Models for Structured Indoor Modeling. arXiv:2506.07491 [cs.CV] https://arxiv.org/abs/ 2506.07491

  46. [47]

    J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al . 2025. Gr00t n1: An open foundation model for ACM Conference, 2026, generalist humanoid robots.arXiv preprint arXiv:2503.147342 (2025)

  47. [48]

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. 2025. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692(2025)

  48. [49]

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. 2025. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830 (2025)

  49. [50]

    Zhifeng Rao, Wenlong Chen, Lei Xie, Xia Hua, Dongfu Yin, Zhen Tian, and F Richard Yu. 2026. AugVLA-3D: Depth-Driven Feature Augmentation for Vision- Language-Action Models.arXiv preprint arXiv:2602.10698(2026)

  50. [51]

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. 2021. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction

  51. [52]

    Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. 2025. VideoVLA: Video Generators Can Be Generalizable Robot Manipulators.Advances in neural information processing systems(2025)

  52. [53]

    Zijian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, and Guangrun Wang. 2026. Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation.arXiv preprint arXiv:2603.00110(2026)

  53. [54]

    Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. 2025. Physical autoregressive model for robotic manipulation without action pretrain- ing.arXiv preprint arXiv:2508.09822(2025)

  54. [55]

    Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, and Jiale Cao. 2025. Geovla: Empowering 3d representations in vision-language-action models.arXiv preprint arXiv:2508.09071(2025)

  55. [56]

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. 2024. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213(2024)

  56. [57]

    An Dinh Vuong, Minh Nhat Vu, and Ian Reid. 2025. Improving Robotic Ma- nipulation with Efficient Geometry-Aware Vision Encoder.arXiv preprint arXiv:2509.15880(2025)

  57. [58]

    Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S Rawat, Yunhao Ge, and Yuzhang Shang. 2026. VLA-Thinker: Boosting Vision- Language-Action Models through Thinking-with-Image Reasoning.arXiv preprint arXiv:2603.14523(2026)

  58. [59]

    Guangcong Wang, Zhaoxi Chen, Chen Change Loy, and Ziwei Liu. 2023. Sparsen- erf: Distilling depth ranking for few-shot novel view synthesis. InProceedings of the IEEE/CVF international conference on computer vision. 9065–9076

  59. [60]

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 5294– 5306

  60. [61]

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. 2025. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372(2025)

  61. [62]

    Zuoxu Wang, Zhijie Yan, Shufei Li, and Jihong Liu. 2025. IndVisSGG: VLM-based scene graph generation for industrial spatial intelligence.Advanced Engineering Informatics65 (2025), 103107

  62. [63]

    Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng

  63. [64]

    Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855(2025)

  64. [65]

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. 2025. Tinyvla: Towards fast, data- efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters(2025)

  65. [66]

    Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. 2024. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 12156–12163

  66. [67]

    Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, et al . 2025. A0: An affordance-aware hierarchical model for general robotic manipulation.arXiv preprint arXiv:2504.12636(2025)

  67. [68]

    Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. 2026. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236(2026)

  68. [69]

    Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. 2020. Blendedmvs: A large-scale dataset for generalized multi- view stereo networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1790–1799

  69. [70]

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al

  70. [71]

    World action models are zero-shot policies.arXiv preprint arXiv:2602.15922 (2026)

  71. [72]

    Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. 2025. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375(2025)

  72. [73]

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 2024. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954(2024)

  73. [74]

    Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lv, Hao Liu, Keze Wang, Liang Lin, and Guangrun Wang. 2026. Stable Language Guidance for Vision-Language- Action Models.arXiv preprint arXiv:2601.04052(2026)

  74. [75]

    Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Ruifeng Zhai, Keze Wang, Liang Lin, and Guangrun Wang. 2026. E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion. arXiv:2511.21542 [cs.RO] https://arxiv.org/abs/2511.21542

  75. [76]

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al . 2025. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447(2025)

  76. [77]

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al . 2025. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference. 1702–1713

  77. [78]

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. 2023. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705(2023)

  78. [79]

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. 2024. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345(2024)

  79. [80]

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404(2024)

  80. [81]

    Jiaying Zhou, Zhihao Zhan, Ruifeng Zhai, Qinhan Lyu, Hao Liu, Keze Wang, Liang Lin, and Guangrun Wang. 2026. TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models.arXiv preprint arXiv:2603.24584(2026)

Showing first 80 references.