pith. sign in

arxiv: 2602.19710 · v2 · pith:3X2GFYZFnew · submitted 2026-02-23 · 💻 cs.CV · cs.LG· cs.RO

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

Pith reviewed 2026-05-21 12:59 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO
keywords vision-language-actionpose tokens3D spatial priorsroboticsdiscrete representationsembodiment alignmentgeneralizationpretraining
0
0 comments X

The pith

Discrete pose tokens let VLA models pretrain universal 3D spatial priors before aligning to specific robot actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Pose-VLA to fix feature collapse in vision-language-action models, which currently mix high-level perception with sparse robot-specific actions and therefore miss fine 3D variations needed for correct behavior. It splits training into a first phase that learns universal spatial priors from many 3D datasets inside a shared camera-centric coordinate frame, then a second phase that maps those priors onto any given robot’s action space using trajectory data. Discrete pose tokens serve as the bridge that carries spatial information across both phases without forcing the model to re-learn geometry from scratch. A sympathetic reader would care because the split promises faster adaptation to new robots and objects while using far fewer real-world demonstrations than end-to-end training.

Core claim

Pose-VLA decouples VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space using discrete pose tokens, followed by a post-training phase for efficient embodiment alignment within robot-specific action space. By treating discrete pose tokens as a universal representation, the method integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. This two-stage pipeline first establishes fundamental spatial grounding via poses and then performs motion alignment through trajectory supervision, yielding 79.5 percent average success on RoboTwin 2.0 and 96.0 percent on LIBERO.

What carries the argument

Discrete pose tokens as a universal representation that integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations.

If this is right

  • The two-stage pipeline first builds spatial grounding from poses and then aligns motion through trajectory supervision.
  • The method reaches 79.5 percent average success rate on the RoboTwin 2.0 benchmark.
  • It achieves competitive 96.0 percent success on the LIBERO benchmark.
  • Real-world tests show robust generalization to diverse objects when only 100 demonstrations per task are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pre-trained spatial model could be reused across multiple robot embodiments with only lightweight action-head fine-tuning.
  • Scaling the 3D pre-training corpus to include more cluttered or dynamic scenes might further improve zero-shot transfer to novel objects.
  • Because pose tokens are discrete and camera-centric, the approach may extend naturally to sim-to-real transfer by aligning simulated and real camera frames before action learning.

Load-bearing premise

Pre-training on universal 3D spatial priors in a unified camera-centric space using discrete pose tokens transfers effectively to embodiment-specific action spaces and resolves feature collapse without losing critical action-relevant variations.

What would settle it

Training an otherwise identical VLA model from scratch on the same robotic demonstrations without the pose-token pre-training stage and measuring whether it reaches within 5 percentage points of 79.5 percent success on RoboTwin 2.0 would test the necessity of the universal pre-training step.

Figures

Figures reproduced from arXiv: 2602.19710 by Haitao Lin, Hanyang Yu, He Zhang, Jingshun Huang, Ping Tan, Xiangyang Xue, Yanwei Fu, Yonggen Ling.

Figure 1
Figure 1. Figure 1: Overview of Pose-VLA. Unlike previous VLAs that rely solely on [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of Pose-VLA. Pose-VLA decouples VLA training into: (1) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Generalization of 3D spatial grounding across unseen scenarios. Pose-VLA exhibits robust generalization across various unseen settings, ranging from [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real-world setup of four representative tasks. Our platform uses a [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Success rate comparison of Pose-VLA and baseline models across [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: T-SNE visualization of VL features across 20 tasks in RoboTwin [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Data statistics of object translation and size in datasets. Translations [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Pose-VLA, a decoupled paradigm for Vision-Language-Action models that separates training into a pre-training phase extracting universal 3D spatial priors in a unified camera-centric space via discrete pose tokens from diverse 3D datasets, followed by a post-training phase for embodiment alignment using robotic trajectories. By treating discrete pose tokens as a universal representation, the approach claims to integrate spatial grounding with geometry-level actions, resolve feature collapse in VLM-based VLAs, and deliver state-of-the-art results including 79.5% average success on RoboTwin 2.0, 96.0% on LIBERO, and robust real-world generalization with only 100 demonstrations per task.

Significance. If the empirical claims hold after detailed validation, the work would represent a meaningful step toward more generalizable VLA policies by decoupling high-level perception from sparse action supervision through pose-based pretraining on large-scale 3D data. The two-stage pipeline offers a plausible route to improved spatial grounding and training efficiency in robotic tasks.

major comments (2)
  1. The abstract states strong benchmark numbers (79.5% on RoboTwin 2.0) and claims resolution of feature collapse but supplies no experimental details, baseline comparisons, ablation studies, or error analysis; the central performance claims cannot be evaluated from the given text.
  2. The assumption that discrete pose tokens serve as a lossless universal bridge integrating 3D priors with robotic trajectories is load-bearing for the transfer claims, yet no bound on quantization error or ablation isolating token resolution is provided, leaving the security of action-discriminative signals unclear.
minor comments (1)
  1. The description of the two-stage pre-training pipeline would benefit from a schematic diagram to illustrate the flow from pose token pre-training to motion alignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have addressed each major comment below with revisions to improve experimental transparency and analysis of the discrete pose token representation.

read point-by-point responses
  1. Referee: The abstract states strong benchmark numbers (79.5% on RoboTwin 2.0) and claims resolution of feature collapse but supplies no experimental details, baseline comparisons, ablation studies, or error analysis; the central performance claims cannot be evaluated from the given text.

    Authors: We agree that the abstract is concise by nature and does not contain the full experimental details. The manuscript body (Sections 4.1–4.3 and 5) already includes baseline comparisons against RT-2, OpenVLA, and other VLA methods, ablation studies on the two-stage pre-training, and error analysis of failure modes on RoboTwin 2.0. To make the central claims more immediately evaluable, we have revised the abstract to briefly note the evaluation protocol and key baselines, and we have added a compact results summary table in the introduction that cross-references the detailed tables and figures in the experimental section. revision: yes

  2. Referee: The assumption that discrete pose tokens serve as a lossless universal bridge integrating 3D priors with robotic trajectories is load-bearing for the transfer claims, yet no bound on quantization error or ablation isolating token resolution is provided, leaving the security of action-discriminative signals unclear.

    Authors: We acknowledge the importance of quantifying potential information loss from discretization. The original submission contained ablations on pre-training objectives but did not isolate vocabulary size. In the revision we have added a new ablation (Table 7) that varies the number of discrete pose tokens (128, 256, 512, 1024) and reports success rates on RoboTwin 2.0, showing that performance plateaus beyond 512 tokens while still preserving action discriminability. We have also added an appendix analysis that measures average L2 reconstruction error between original 3D poses and poses decoded from the discrete tokens on held-out 3D datasets. A strict theoretical bound on quantization error is difficult to derive without strong distributional assumptions; we therefore rely on the empirical evidence and have clarified in Section 3.2 how the subsequent trajectory-alignment stage compensates for any residual loss. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results with no derivation chain

full rationale

The paper presents Pose-VLA as a two-stage pre-training and post-training paradigm that uses discrete pose tokens to bridge 3D spatial priors and robotic trajectories, with reported performance on RoboTwin 2.0 and LIBERO framed explicitly as empirical outcomes of training rather than quantities derived from fitted parameters or self-referential definitions. No equations, mathematical derivations, or load-bearing self-citations that reduce the central claims to their own inputs appear in the manuscript. The method is self-contained against external benchmarks through reported success rates and real-world experiments, satisfying the criteria for an honest non-finding of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the effectiveness of the introduced pose-token representation and the transferability of camera-centric spatial priors, both of which are postulated without independent verification in the abstract.

axioms (2)
  • domain assumption VLM backbones optimized for VQA overlook subtle 3D state variations that dictate distinct action patterns
    Invoked in the abstract as the root cause of misalignments in existing VLA models.
  • ad hoc to paper Discrete pose tokens can serve as a universal representation that seamlessly integrates spatial grounding from 3D datasets with robotic trajectories
    Introduced in the abstract as the key mechanism enabling the decoupled paradigm.
invented entities (1)
  • discrete pose tokens no independent evidence
    purpose: Universal representation for 3D spatial priors in camera-centric space
    New representational unit proposed to bridge perception and action supervision.

pith-pipeline@v0.9.0 · 5781 in / 1578 out tokens · 52910 ms · 2026-05-21T12:59:51.228033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 20 internal anchors

  1. [1]

    Objectron: A large scale dataset of object-centric videos in the wild with pose annotations

    Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822–7831, 2021

  2. [2]

    On the representation degradation in vision- language-action models

    Anonymous. On the representation degradation in vision- language-action models. InSubmitted to International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/forum?id=qR2TjMZ10B

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  6. [6]

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A vi...

  7. [7]

    In9th Annual Conference on Robot Learning, 2025

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a vision-language-action model with open- world generalization. In9th Annual Conference on Robot Learning, 2025

  8. [8]

    Omni3d: A large benchmark and model for 3d object detection in the wild

    Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13154– 13164, 2023

  9. [9]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  10. [10]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yi- heng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain random- ization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  11. [11]

    Training strategies for efficient embodied reasoning

    William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning. arXiv preprint arXiv:2505.08243, 2025

  12. [12]

    Language- image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024

    Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Kr ¨ahenb¨uhl, Yan Wang, et al. Language- image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  14. [14]

    Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

    Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

  15. [15]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  16. [16]

    Pow3r: Empow- ering unconstrained 3d reconstruction with camera and scene priors

    Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3r: Empow- ering unconstrained 3d reconstruction with camera and scene priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1071–1081, 2025

  17. [17]

    Kachaev, M

    Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K Kovalev, and Aleksandr I Panov. Don’t blind your vla: Aligning visual representations for ood generalization.arXiv preprint arXiv:2510.25616, 2025

  18. [18]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  19. [19]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

  20. [20]

    Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation align- ment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

  21. [21]

    Discrete diffu- sion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

    Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiang- miao Pang, Yao Mu, and Ping Luo. Discrete diffu- sion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

  22. [22]

    Onetwovla: A unified vision-language-action model with adaptive reasoning,

    Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025

  23. [23]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  24. [24]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36, 2024

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36, 2024

  25. [25]

    HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language- action model.arXiv preprint arXiv:2503.10631, 2025

  26. [26]

    Rectified Flow: A Marginal Preserving Approach to Optimal Transport

    Qiang Liu. Rectified flow: A marginal preserv- ing approach to optimal transport.arXiv preprint arXiv:2209.14577, 2022

  27. [27]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  28. [28]

    Spatial- reasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

    Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatial- reasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

  29. [29]

    Locateanything3d: Vision- language 3d detection with chain-of-sight.arXiv preprint arXiv:2511.20648, 2025

    Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu- Xiong Wang, and Zhiding Yu. Locateanything3d: Vision- language 3d detection with chain-of-sight.arXiv preprint arXiv:2511.20648, 2025

  30. [30]

    Spa- tiallm: Training large language models for structured in- door modeling.arXiv preprint arXiv:2506.07491, 2025

    Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spa- tiallm: Training large language models for structured in- door modeling.arXiv preprint arXiv:2506.07491, 2025

  31. [31]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  32. [32]

    Eo-1: Interleaved vision- text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

    Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision- text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025

  33. [33]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  34. [34]

    Qwen3-vl: A frontier multimodal large lan- guage model

    Qwen Team. Qwen3-vl: A frontier multimodal large lan- guage model. https://github.com/QwenLM/Qwen3-VL,

  35. [35]

    Accessed: 2026-01-22

  36. [36]

    MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xi- angyu Zhang, and Gao Huang. Memoryvla: Perceptual- cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

  37. [37]

    Sun rgb-d: A rgb-d scene understanding benchmark suite

    Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015

  38. [38]

    Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial rea- soning

    Qi Sun, Pengfei Hong, Tej Deep Pala, Vernon Toh, U-Xuan Tan, Deepanway Ghosal, and Soujanya Poria. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial rea- soning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14199–14214, 2025

  39. [39]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  40. [40]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  41. [41]

    Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016,

    Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized ac- tion tokenizers.arXiv preprint arXiv:2507.01016, 2025

  42. [42]

    N3d-vlm: Native 3d grounding enables accu- rate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025

    Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, and Dong Yu. N3d-vlm: Native 3d grounding enables accu- rate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025

  43. [43]

    Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,

    Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Vlm-grounder: A vlm agent for zero-shot 3d visual grounding.arXiv preprint arXiv:2410.13860, 2024

  44. [44]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

  45. [45]

    Instructvla: Vision-language-action instruction tuning from understanding to manipulation

    Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520, 2025

  46. [46]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

  47. [47]

    Sigmoid loss for language image pre- training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  48. [48]

    Zhang, X

    Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action mod- els.arXiv preprint arXiv:2601.03309, 2026

  49. [49]

    Omni6dpose: A benchmark and model for universal 6d object pose estimation and tracking

    Jiyao Zhang, Weiyao Huang, Bo Peng, Mingdong Wu, Fei Hu, Zijian Chen, Bo Zhao, and Hao Dong. Omni6dpose: A benchmark and model for universal 6d object pose estimation and tracking. InEuropean Con- ference on Computer Vision, pages 199–216. Springer, 2024

  50. [50]

    Cot-vla: Visual chain- of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain- of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

  51. [51]

    Chatvla: Unified multimodal understanding and robot control with vision- language-action model

    Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision- language-action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025

  52. [52]

    Rt-2: Vision-language- action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. APPENDIX This supplemental material is organized as follows: In Section A, we provide...