arxiv: 2604.12908 · v1 · submitted 2026-04-14 · 💻 cs.RO

Recognition: unknown

Robotic Manipulation is Vision-to-Geometry Mapping (f(v) rightarrow G): Vision-Geometry Backbones over Language and Video Models

Zijian Song , Qichang Li , Jiawei Zhou , Zhenlong Yuan , Tianshui Chen , Liang Lin , Guangrun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic manipulationvision-geometry mapping3D world modelsVGA modelzero-shot generalizationphysical actionsmanipulation tasks

0 comments

The pith

Robotic manipulation is a vision-to-geometry mapping problem best solved by pretrained 3D world models rather than language or video backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that physical actions depend on exact 3D positions and spatial relations, so generalizable robot control needs backbones that start from native 3D geometry instead of semantic language data or 2D video sequences. Current vision-language and video-predictive models rely on representations shaped by 2D priors or concepts that do not match the geometric demands of manipulation. The authors introduce the Vision-Geometry-Action model that substitutes a pretrained 3D world model for those backbones, adds a Progressive Volumetric Modulation module for consistency, and trains jointly to produce actions directly from visual inputs. Experiments show this yields higher precision in simulation than leading VLA baselines and stronger zero-shot performance on real robots facing new viewpoints.

Core claim

At its core, robotic manipulation is a problem of vision-to-geometry mapping. The Vision-Geometry-Action model replaces conventional language or video backbones with a pretrained native 3D world model to translate visual inputs directly into physical actions, supported by Progressive Volumetric Modulation and joint training for geometric consistency.

What carries the argument

Vision-Geometry-Action (VGA) model that conditions action generation on pretrained 3D world model representations to create a direct vision-to-geometry mapping.

If this is right

VGA outperforms top-tier VLA baselines including π0.5 and GeoVLA on simulation benchmarks for precise manipulation.
VGA achieves stronger zero-shot generalization to unseen viewpoints in real-world robot deployments than π0.5.
Direct operation on native 3D representations, rather than translation through language or 2D priors, supports more generalizable physical intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robotics foundation models may shift priority toward 3D geometry sources over multimodal language training for spatial tasks.
The same 3D backbone approach could apply to navigation or assembly problems that also hinge on precise spatial relations.
Reduced dependence on language intermediaries might simplify training pipelines when only geometric accuracy matters.

Load-bearing premise

Native 3D geometric representations align better with the spatial requirements of physical actions than representations shaped by language semantics or 2D video priors.

What would settle it

A controlled test in which a vision-language model achieves equal or higher success rates than VGA on precise manipulation tasks under novel real-world viewpoints would disprove the claimed superiority.

Figures

Figures reproduced from arXiv: 2604.12908 by Guangrun Wang, Jiawei Zhou, Liang Lin, Qichang Li, Tianshui Chen, Zhenlong Yuan, Zijian Song.

**Figure 1.** Figure 1: Robotic manipulation as vision-to-geometry mapping (𝑓 (𝑣) → 𝐺). Physical actions like reaching, grasping, and orienting are inherently driven by geometric properties, such as 3D position, rotation, and spatial relationships. Therefore, we argue that a vision-geometry backbone provides a stronger foundation for generalizable robotic control than prevalent vision-language or video models. grasping, and orien… view at source ↗

**Figure 2.** Figure 2: Overview of our VGA model. (a) The left column compares our VGA framework with representative robot learning paradigms. VGA differs from them by leveraging a pretrained 3D world model as the backbone, providing native 3D representations aligned with physical actions. (b) The right column illustrates the workflow of the VGA model. Multimodal inputs are tokenized into a unified sequence and processed by a p… view at source ↗

**Figure 3.** Figure 3: Simulation rollouts and depth predictions on the LIBERO benchmark. The results show that VGA achieves precise physical manipulation with precise depth predictions. subsequently projected back to the latent space: h (𝑙+1) 𝑑𝑒𝑐 = Linear( [h ′ (𝑙+1) 𝑑𝑒𝑐 , a (𝑙) 𝑑𝑒𝑐 ]), (10) where [·, ·] denotes the concatenation. By interleaving this dualstage modulation across all layers, PVM sustains a high-fidelity flow o… view at source ↗

**Figure 4.** Figure 4: Real world Configuration. The platform is equipped with one wrist camera and two fixed cameras. The fixed cameras are used for in-distribution and out-ofdistribution evaluation, respectively. 6.1 Experimental Setup Our real-world experiments are conducted on a Franka Panda robotic arm equipped with three RealSense D415 cameras. A wrist camera provides observations aligned with the end-effector during mani… view at source ↗

**Figure 5.** Figure 5: Visualization of real world manipulation. Our method demonstrates coherent and stable manipulation behaviors under both seen and unseen viewpoints, highlighting its strong generalization. respectively, indicating stronger performance in real-world execution. Compared to 𝜋0.5, our method achieves highly competitive results, matching its performance on Press Button and on the more challenging Stack Cube tas… view at source ↗

**Figure 6.** Figure 6: Language-grounded grasping under varying layouts. This figure presents the results of real-world grasping with three visually similar objects arranged in different layouts. Each row corresponds to a different spatial configuration, and the robot is instructed to pick a target object. VGA consistently identifies and grasps the correct object regardless of its position, demonstrating robust language grou… view at source ↗

**Figure 7.** Figure 7: Impact of joint training and LoRA rank. Part (a) presents the ablation study on LoRA rank. The results plateau around rank 64. Part (b) compares the convergence speed between models trained with and without joint training, evaluated using checkpoints from 1K to 5K steps on LIBERO-Spatial. Joint training leads to faster convergence and improved early-stage performance, indicating better data efficiency [P… view at source ↗

**Figure 9.** Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Initial state variations for evaluation tasks. This figure illustrates the three evaluation tasks under four types of initial condition variations. Despite these diverse and challenging variations, our VGA consistently achieves high manipulation success rates across all settings. outliers characterized by unstable or jerky movements. Additionally, we filtered out trajectories with abnormal durations—eith… view at source ↗

read the original abstract

At its core, robotic manipulation is a problem of vision-to-geometry mapping ($f(v) \rightarrow G$). Physical actions are fundamentally defined by geometric properties like 3D positions and spatial relationships. Consequently, we argue that the foundation for generalizable robotic control should be a vision-geometry backbone, rather than the widely adopted vision-language or video models. Conventional VLA and video-predictive models rely on backbones pretrained on large-scale 2D image-text or temporal pixel data. While effective, their representations are largely shaped by semantic concepts or 2D priors, which do not intrinsically align with the precise 3D geometric nature required for physical manipulation. Driven by this insight, we propose the Vision-Geometry-Action (VGA) model, which directly conditions action generation on pretrained native 3D representations. Specifically, VGA replaces conventional language or video backbones with a pretrained 3D world model, establishing a seamless vision-to-geometry mapping that translates visual inputs directly into physical actions. To further enhance geometric consistency, we introduce a Progressive Volumetric Modulation module and adopt a joint training strategy. Extensive experiments validate the effectiveness of our approach. In simulation benchmarks, VGA outperforms top-tier VLA baselines including $\pi_{0.5}$ and GeoVLA, demonstrating its superiority in precise manipulation. More importantly, VGA exhibits remarkable zero-shot generalization to unseen viewpoints in real-world deployments, consistently outperforming $\pi_{0.5}$. These results highlight that operating on native 3D representations-rather than translating through language or 2D video priors-is a highly promising direction for achieving generalizable physical intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues for swapping language and video backbones for pretrained 3D world models in robot manipulation, but the experiments do not isolate whether the 3D backbone itself produces the reported gains.

read the letter

The main takeaway is that VGA replaces the usual language or video backbone with a pretrained 3D world model and adds a Progressive Volumetric Modulation module to keep the geometry consistent during action generation. They report better precise manipulation in simulation than π0.5 and GeoVLA, plus stronger zero-shot viewpoint generalization on real hardware. That combination is the concrete new element on top of existing 3D world models and standard VLA training.

Referee Report

3 major / 2 minor

Summary. The paper argues that robotic manipulation is fundamentally a vision-to-geometry mapping problem (f(v) → G) and that native 3D representations should replace vision-language or video backbones for generalizable control. It introduces the Vision-Geometry-Action (VGA) model, which conditions actions on a pretrained 3D world model, adds a Progressive Volumetric Modulation module, and uses joint training; the abstract claims this yields superior performance on simulation benchmarks versus π0.5 and GeoVLA plus zero-shot real-world generalization to unseen viewpoints.

Significance. If the central claim holds and the performance edge is attributable to native 3D geometry rather than ancillary design choices, the work could shift robotic foundation models toward explicit 3D world models, improving precision and viewpoint invariance in manipulation tasks. The absence of ablations, error bars, and dataset details in the reported results, however, leaves the magnitude of this potential contribution uncertain.

major comments (3)

[Abstract] Abstract: the claim that VGA 'outperforms top-tier VLA baselines including π0.5 and GeoVLA' and exhibits 'remarkable zero-shot generalization' cannot be evaluated because no error bars, dataset sizes, number of trials, or statistical tests are provided; without these, it is impossible to determine whether the reported gains are robust or driven by the 3D backbone itself.
[Abstract] Abstract and experimental description: the manuscript does not report ablations that isolate the contribution of the pretrained 3D world model from the Progressive Volumetric Modulation module, the joint training strategy, or differences in the action head and data pipeline. Consequently the central premise—that native 3D representations are the load-bearing factor for the observed gains over π0.5 and GeoVLA—remains untested.
[Abstract] Abstract: the zero-shot viewpoint generalization result is presented as evidence for the superiority of 3D geometry, yet no quantitative metrics (e.g., success rate deltas, viewpoint ranges, or failure modes) or comparison conditions (e.g., 2D backbone with viewpoint augmentation) are supplied, weakening the link between the architectural choice and the claimed generalization.

minor comments (2)

[Abstract] The abstract introduces the notation f(v) → G without defining the symbol G or the precise geometric output space; a short clarification of the target representation would improve readability.
[Abstract] The phrase 'seamless vision-to-geometry mapping' is used repeatedly; a concrete description of how the 3D world model outputs are aligned with the action space (e.g., via specific layers or loss terms) would help readers assess the mapping.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the concerns identify gaps in the current presentation, we have revised the manuscript to incorporate additional details, experiments, and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that VGA 'outperforms top-tier VLA baselines including π0.5 and GeoVLA' and exhibits 'remarkable zero-shot generalization' cannot be evaluated because no error bars, dataset sizes, number of trials, or statistical tests are provided; without these, it is impossible to determine whether the reported gains are robust or driven by the 3D backbone itself.

Authors: We agree that the absence of these statistical details in the abstract limits the ability to assess robustness. In the revised manuscript we have expanded the abstract and the experimental results section to report error bars across all benchmarks, the exact number of trials and dataset sizes used, and the results of statistical significance tests. These additions make it possible to evaluate whether the observed improvements are reliable and attributable to the architectural choices. revision: yes
Referee: [Abstract] Abstract and experimental description: the manuscript does not report ablations that isolate the contribution of the pretrained 3D world model from the Progressive Volumetric Modulation module, the joint training strategy, or differences in the action head and data pipeline. Consequently the central premise—that native 3D representations are the load-bearing factor for the observed gains over π0.5 and GeoVLA—remains untested.

Authors: We acknowledge that isolating the contribution of the pretrained 3D world model is essential to support the central claim. We have added a dedicated ablation study in the revised experimental section that compares the full VGA model against controlled variants: one without the pretrained 3D backbone (replaced by a 2D or language backbone), one without Progressive Volumetric Modulation, and one using separate rather than joint training. The new results show that removing the native 3D representations produces the largest performance drop relative to the VLA baselines, while the other components provide complementary gains. revision: yes
Referee: [Abstract] Abstract: the zero-shot viewpoint generalization result is presented as evidence for the superiority of 3D geometry, yet no quantitative metrics (e.g., success rate deltas, viewpoint ranges, or failure modes) or comparison conditions (e.g., 2D backbone with viewpoint augmentation) are supplied, weakening the link between the architectural choice and the claimed generalization.

Authors: We agree that stronger quantitative support is needed to link the 3D geometry backbone to viewpoint-invariant generalization. In the revision we have augmented the real-world evaluation section with success-rate deltas across multiple unseen viewpoint ranges, a breakdown of failure modes, and a direct comparison against a 2D backbone baseline trained with explicit viewpoint augmentation. These additions provide concrete evidence that the native 3D representations confer advantages beyond what 2D augmentation alone can achieve. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual premise with independent experimental validation

full rationale

The paper advances a first-principles-style argument that manipulation reduces to vision-to-geometry mapping and therefore favors native 3D backbones over VLA or video models. This premise is stated directly in the abstract and introduction without deriving it from fitted parameters, self-referential equations, or prior self-citations that themselves depend on the target result. VGA is introduced as a concrete architecture (pretrained 3D world model plus Progressive Volumetric Modulation and joint training) whose performance is then measured against baselines. No step equates a 'prediction' to its own training inputs by construction, nor does any load-bearing uniqueness claim collapse to an unverified self-citation. The derivation chain therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper rests on the domain assumption that 3D geometric representations are intrinsically better aligned with manipulation than semantic or 2D priors; the VGA model and Progressive Volumetric Modulation module are new constructs introduced without independent evidence outside the reported experiments.

axioms (1)

domain assumption Pretrained 3D world models supply native geometric representations that align directly with physical manipulation requirements.
Invoked to justify replacing language and video backbones; appears in the motivation and method description.

invented entities (2)

Vision-Geometry-Action (VGA) model no independent evidence
purpose: Direct conditioning of action generation on pretrained 3D representations.
New architecture proposed to implement the vision-to-geometry mapping.
Progressive Volumetric Modulation module no independent evidence
purpose: Enhance geometric consistency in the 3D representation pipeline.
Introduced as an additional component to improve the core mapping.

pith-pipeline@v0.9.0 · 5629 in / 1405 out tokens · 36217 ms · 2026-05-10T15:17:04.847698+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 55 canonical work pages · 27 internal anchors

[1]

Ali Abouzeid, Malak Mansour, Zezhou Sun, and Dezhen Song. 2025. GeoAware- VLA: Implicit Geometry Aware Vision-Language-Action Model.arXiv preprint arXiv:2509.14117(2025)

work page arXiv 2025
[2]

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. 2025. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030(2025)

work page internal anchor Pith review arXiv 2025
[4]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al . 2025. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669(2025)

work page internal anchor Pith review arXiv 2025
[7]

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. 2025. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111(2025)

work page internal anchor Pith review arXiv 2025
[8]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. InProceedings of the International Conference on Computer Vision (ICCV)

2021
[9]

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al . 2025. WorldVLA: Towards Autoregressive Action World Model.arXiv preprint arXiv:2506.21539(2025)

work page internal anchor Pith review arXiv 2025
[10]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 14455–14465

2024
[11]

Guangyan Chen, Meiling Wang, Te Cui, Luojie Yang, Qi Shao, Lin Zhao, Tianle Zhang, Yihang Li, Yi Yang, and Yufeng Yue. 2025. Unifying Latent Action and Latent State Pre-training for Policy Learning from Videos. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11

2025
[12]

Hongyu Chen, Liang Lin, and Guangrun Wang. 2026. OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling. arXiv:2604.09580 [cs.AI] https://arxiv.org/abs/2604.09580

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Shizhe Chen, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. 2024. Sugar: Pre-training 3d visual representations for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18049–18060

2024
[14]

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yan- jiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. 2025. Villa- x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682(2025)

work page arXiv 2025
[15]

Yuhao Chen, Zhihao Zhan, Xiaoxin Lin, Zijian Song, Hao Liu, Qinhan Lyu, Yubo Zu, Xiao Chen, Zhiyuan Liu, Tao Pu, et al. 2026. RADAR: Benchmarking Vision- Language-Action Generalization via Real-World Dynamics, Spatial-Physical In- telligence, and Autonomous Evaluation.arXiv preprint arXiv:2602.10980(2026)

work page arXiv 2026
[16]

Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al . 2025. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data. arXiv preprint arXiv:2505.03233(2025)

work page arXiv 2025
[17]

David Ha and Jürgen Schmidhuber. 2018. World models.arXiv preprint arXiv:1803.101222, 3 (2018), 440

work page internal anchor Pith review arXiv 2018
[18]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3

2022
[19]

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. 2024. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803(2024)

work page internal anchor Pith review arXiv 2024
[20]

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. 2025. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815(2025)

work page arXiv 2025
[21]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...
[22]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

𝜋0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054 [cs.LG] https://arxiv.org/abs/2504.16054

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Yueru Jia, Jiaming Liu, Sixiang Chen, Chenyang Gu, Zhilue Wang, Longzan Luo, Lily Lee, Pengwei Wang, Zhongyuan Wang, Renrui Zhang, et al. 2024. Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic manipulation.arXiv preprint arXiv:2411.18623(2024)

work page arXiv 2024
[24]

Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-tuning vision-language- action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645 (2025)

work page internal anchor Pith review arXiv 2025
[25]

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. 2026. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163(2026)

work page internal anchor Pith review arXiv 2026
[26]

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al
[27]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al . 2025. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917 (2025)

work page internal anchor Pith review arXiv 2025
[29]

Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. 2026. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters11, 3 (2026), 2506–2513

2026
[30]

Haoyuan Li, Yanpeng Zhou, Yufei Gao, Tao Tang, Jianhua Han, Yujie Yuan, Dave Zhenyu Chen, Jiawang Bian, Hang Xu, and Xiaodan Liang. 2025. Does your 3d encoder really work? when pretrain-sft from 2d vlms meets 3d vlms.arXiv preprint arXiv:2506.05318(2025)

work page arXiv 2025
[31]

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. 2026. Causal World Modeling for Robot Control.arXiv preprint arXiv:2601.21998(2026)

work page internal anchor Pith review arXiv 2026
[32]

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. 2025. Unified video action model.arXiv preprint arXiv:2503.00200(2025)

work page internal anchor Pith review arXiv 2025
[33]

Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, and Guangrun Wang. 2025. VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling.arXiv preprint arXiv:2512.02902(2025)

work page arXiv 2025
[34]

Xiaoqi Li, Liang Heng, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu, Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, et al . 2025. 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation. In9th Annual Conference on Robot Learning

2025
[35]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

work page internal anchor Pith review arXiv 2023
[36]

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. 2025. Video generators are robot policies.arXiv preprint arXiv:2508.00795(2025)

work page arXiv 2025
[37]

Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. 2025. Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883(2025)

work page arXiv 2025
[38]

Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. 2025. Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416(2025)

work page arXiv 2025
[39]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36 (2023), 44776–44791

2023
[40]

Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin
[41]

Deconstructing Spatial Intelligence in Vision-Language Models.Authorea Preprints(2025)

2025
[42]

Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin. 2026. Spatial Intelligence in Vision-Language Models: A Comprehensive Survey. (2026)

2026
[43]

Haoyun Liu, Jianzhuang Zhao, Xinyuan Chang, Tianle Shi, Chuanzhang Meng, Jiayuan Tan, Feng Xiong, Tong Lin, Dongjie Huo, Mu Xu, et al . 2026. Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models.arXiv preprint arXiv:2603.01766(2026)

work page arXiv 2026
[44]

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al . 2025. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631(2025)

work page arXiv 2025
[45]

Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, and Yanwei Fu. 2026. ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation.arXiv preprint arXiv:2601.08325(2026)

work page arXiv 2026
[46]

Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. 2025. SpatialLM: Training Large Language Models for Structured Indoor Modeling. arXiv:2506.07491 [cs.CV] https://arxiv.org/abs/ 2506.07491

work page arXiv 2025
[47]

J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al . 2025. Gr00t n1: An open foundation model for ACM Conference, 2026, generalist humanoid robots.arXiv preprint arXiv:2503.147342 (2025)

work page internal anchor Pith review arXiv 2025
[48]

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. 2025. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692(2025)

work page arXiv 2025
[49]

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. 2025. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830 (2025)

work page internal anchor Pith review arXiv 2025
[50]

Zhifeng Rao, Wenlong Chen, Lei Xie, Xia Hua, Dongfu Yin, Zhen Tian, and F Richard Yu. 2026. AugVLA-3D: Depth-Driven Feature Augmentation for Vision- Language-Action Models.arXiv preprint arXiv:2602.10698(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. 2021. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction

2021
[52]

Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. 2025. VideoVLA: Video Generators Can Be Generalizable Robot Manipulators.Advances in neural information processing systems(2025)

2025
[53]

Zijian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, and Guangrun Wang. 2026. Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation.arXiv preprint arXiv:2603.00110(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. 2025. Physical autoregressive model for robotic manipulation without action pretrain- ing.arXiv preprint arXiv:2508.09822(2025)

work page arXiv 2025
[55]

Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, and Jiale Cao. 2025. Geovla: Empowering 3d representations in vision-language-action models.arXiv preprint arXiv:2508.09071(2025)

work page arXiv 2025
[56]

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. 2024. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213(2024)

work page internal anchor Pith review arXiv 2024
[57]

An Dinh Vuong, Minh Nhat Vu, and Ian Reid. 2025. Improving Robotic Ma- nipulation with Efficient Geometry-Aware Vision Encoder.arXiv preprint arXiv:2509.15880(2025)

work page arXiv 2025
[58]

Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S Rawat, Yunhao Ge, and Yuzhang Shang. 2026. VLA-Thinker: Boosting Vision- Language-Action Models through Thinking-with-Image Reasoning.arXiv preprint arXiv:2603.14523(2026)

work page arXiv 2026
[59]

Guangcong Wang, Zhaoxi Chen, Chen Change Loy, and Ziwei Liu. 2023. Sparsen- erf: Distilling depth ranking for few-shot novel view synthesis. InProceedings of the IEEE/CVF international conference on computer vision. 9065–9076

2023
[60]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 5294– 5306

2025
[61]

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. 2025. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372(2025)

work page arXiv 2025
[62]

Zuoxu Wang, Zhijie Yan, Shufei Li, and Jihong Liu. 2025. IndVisSGG: VLM-based scene graph generation for industrial spatial intelligence.Advanced Engineering Informatics65 (2025), 103107

2025
[63]

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng
[64]

Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855(2025)

work page Pith review arXiv 2025
[65]

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. 2025. Tinyvla: Towards fast, data- efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters(2025)

2025
[66]

Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. 2024. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 12156–12163

2024
[67]

Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, et al . 2025. A0: An affordance-aware hierarchical model for general robotic manipulation.arXiv preprint arXiv:2504.12636(2025)

work page arXiv 2025
[68]

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. 2026. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. 2020. Blendedmvs: A large-scale dataset for generalized multi- view stereo networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1790–1799

2020
[70]

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al
[71]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922 (2026)

work page internal anchor Pith review arXiv 2026
[72]

Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. 2025. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375(2025)

work page arXiv 2025
[73]

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 2024. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954(2024)

work page internal anchor Pith review arXiv 2024
[74]

Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lv, Hao Liu, Keze Wang, Liang Lin, and Guangrun Wang. 2026. Stable Language Guidance for Vision-Language- Action Models.arXiv preprint arXiv:2601.04052(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[75]

Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Ruifeng Zhai, Keze Wang, Liang Lin, and Guangrun Wang. 2026. E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion. arXiv:2511.21542 [cs.RO] https://arxiv.org/abs/2511.21542

work page arXiv 2026
[76]

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al . 2025. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447(2025)

work page arXiv 2025
[77]

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al . 2025. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference. 1702–1713

2025
[78]

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. 2023. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705(2023)

work page internal anchor Pith review arXiv 2023
[79]

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. 2024. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345(2024)

work page arXiv 2024
[80]

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404(2024)

work page internal anchor Pith review arXiv 2024
[81]

Jiaying Zhou, Zhihao Zhan, Ruifeng Zhai, Qinhan Lyu, Hao Liu, Keze Wang, Liang Lin, and Guangrun Wang. 2026. TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models.arXiv preprint arXiv:2603.24584(2026)

work page arXiv 2026

Showing first 80 references.