Universal Pose Pretraining for Generalizable Vision-Language-Action Policies
Pith reviewed 2026-05-21 12:59 UTC · model grok-4.3
The pith
Discrete pose tokens let VLA models pretrain universal 3D spatial priors before aligning to specific robot actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pose-VLA decouples VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space using discrete pose tokens, followed by a post-training phase for efficient embodiment alignment within robot-specific action space. By treating discrete pose tokens as a universal representation, the method integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. This two-stage pipeline first establishes fundamental spatial grounding via poses and then performs motion alignment through trajectory supervision, yielding 79.5 percent average success on RoboTwin 2.0 and 96.0 percent on LIBERO.
What carries the argument
Discrete pose tokens as a universal representation that integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations.
If this is right
- The two-stage pipeline first builds spatial grounding from poses and then aligns motion through trajectory supervision.
- The method reaches 79.5 percent average success rate on the RoboTwin 2.0 benchmark.
- It achieves competitive 96.0 percent success on the LIBERO benchmark.
- Real-world tests show robust generalization to diverse objects when only 100 demonstrations per task are available.
Where Pith is reading between the lines
- The same pre-trained spatial model could be reused across multiple robot embodiments with only lightweight action-head fine-tuning.
- Scaling the 3D pre-training corpus to include more cluttered or dynamic scenes might further improve zero-shot transfer to novel objects.
- Because pose tokens are discrete and camera-centric, the approach may extend naturally to sim-to-real transfer by aligning simulated and real camera frames before action learning.
Load-bearing premise
Pre-training on universal 3D spatial priors in a unified camera-centric space using discrete pose tokens transfers effectively to embodiment-specific action spaces and resolves feature collapse without losing critical action-relevant variations.
What would settle it
Training an otherwise identical VLA model from scratch on the same robotic demonstrations without the pose-token pre-training stage and measuring whether it reaches within 5 percentage points of 79.5 percent success on RoboTwin 2.0 would test the necessity of the universal pre-training step.
Figures
read the original abstract
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Pose-VLA, a decoupled paradigm for Vision-Language-Action models that separates training into a pre-training phase extracting universal 3D spatial priors in a unified camera-centric space via discrete pose tokens from diverse 3D datasets, followed by a post-training phase for embodiment alignment using robotic trajectories. By treating discrete pose tokens as a universal representation, the approach claims to integrate spatial grounding with geometry-level actions, resolve feature collapse in VLM-based VLAs, and deliver state-of-the-art results including 79.5% average success on RoboTwin 2.0, 96.0% on LIBERO, and robust real-world generalization with only 100 demonstrations per task.
Significance. If the empirical claims hold after detailed validation, the work would represent a meaningful step toward more generalizable VLA policies by decoupling high-level perception from sparse action supervision through pose-based pretraining on large-scale 3D data. The two-stage pipeline offers a plausible route to improved spatial grounding and training efficiency in robotic tasks.
major comments (2)
- The abstract states strong benchmark numbers (79.5% on RoboTwin 2.0) and claims resolution of feature collapse but supplies no experimental details, baseline comparisons, ablation studies, or error analysis; the central performance claims cannot be evaluated from the given text.
- The assumption that discrete pose tokens serve as a lossless universal bridge integrating 3D priors with robotic trajectories is load-bearing for the transfer claims, yet no bound on quantization error or ablation isolating token resolution is provided, leaving the security of action-discriminative signals unclear.
minor comments (1)
- The description of the two-stage pre-training pipeline would benefit from a schematic diagram to illustrate the flow from pose token pre-training to motion alignment.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have addressed each major comment below with revisions to improve experimental transparency and analysis of the discrete pose token representation.
read point-by-point responses
-
Referee: The abstract states strong benchmark numbers (79.5% on RoboTwin 2.0) and claims resolution of feature collapse but supplies no experimental details, baseline comparisons, ablation studies, or error analysis; the central performance claims cannot be evaluated from the given text.
Authors: We agree that the abstract is concise by nature and does not contain the full experimental details. The manuscript body (Sections 4.1–4.3 and 5) already includes baseline comparisons against RT-2, OpenVLA, and other VLA methods, ablation studies on the two-stage pre-training, and error analysis of failure modes on RoboTwin 2.0. To make the central claims more immediately evaluable, we have revised the abstract to briefly note the evaluation protocol and key baselines, and we have added a compact results summary table in the introduction that cross-references the detailed tables and figures in the experimental section. revision: yes
-
Referee: The assumption that discrete pose tokens serve as a lossless universal bridge integrating 3D priors with robotic trajectories is load-bearing for the transfer claims, yet no bound on quantization error or ablation isolating token resolution is provided, leaving the security of action-discriminative signals unclear.
Authors: We acknowledge the importance of quantifying potential information loss from discretization. The original submission contained ablations on pre-training objectives but did not isolate vocabulary size. In the revision we have added a new ablation (Table 7) that varies the number of discrete pose tokens (128, 256, 512, 1024) and reports success rates on RoboTwin 2.0, showing that performance plateaus beyond 512 tokens while still preserving action discriminability. We have also added an appendix analysis that measures average L2 reconstruction error between original 3D poses and poses decoded from the discrete tokens on held-out 3D datasets. A strict theoretical bound on quantization error is difficult to derive without strong distributional assumptions; we therefore rely on the empirical evidence and have clarified in Section 3.2 how the subsequent trajectory-alignment stage compensates for any residual loss. revision: partial
Circularity Check
No significant circularity; empirical results with no derivation chain
full rationale
The paper presents Pose-VLA as a two-stage pre-training and post-training paradigm that uses discrete pose tokens to bridge 3D spatial priors and robotic trajectories, with reported performance on RoboTwin 2.0 and LIBERO framed explicitly as empirical outcomes of training rather than quantities derived from fitted parameters or self-referential definitions. No equations, mathematical derivations, or load-bearing self-citations that reduce the central claims to their own inputs appear in the manuscript. The method is self-contained against external benchmarks through reported success rates and real-world experiments, satisfying the criteria for an honest non-finding of circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption VLM backbones optimized for VQA overlook subtle 3D state variations that dictate distinct action patterns
- ad hoc to paper Discrete pose tokens can serve as a universal representation that seamlessly integrates spatial grounding from 3D datasets with robotic trajectories
invented entities (1)
-
discrete pose tokens
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Objectron: A large scale dataset of object-centric videos in the wild with pose annotations
Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822–7831, 2021
work page 2021
-
[2]
On the representation degradation in vision- language-action models
Anonymous. On the representation degradation in vision- language-action models. InSubmitted to International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/forum?id=qR2TjMZ10B
work page 2026
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A vi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
In9th Annual Conference on Robot Learning, 2025
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a vision-language-action model with open- world generalization. In9th Annual Conference on Robot Learning, 2025
work page 2025
-
[8]
Omni3d: A large benchmark and model for 3d object detection in the wild
Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13154– 13164, 2023
work page 2023
-
[9]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yi- heng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain random- ization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Training strategies for efficient embodied reasoning
William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning. arXiv preprint arXiv:2505.08243, 2025
-
[12]
Language- image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024
Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Kr ¨ahenb¨uhl, Yan Wang, et al. Language- image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024
-
[13]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025
Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025
-
[15]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Pow3r: Empow- ering unconstrained 3d reconstruction with camera and scene priors
Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3r: Empow- ering unconstrained 3d reconstruction with camera and scene priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1071–1081, 2025
work page 2025
-
[17]
Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K Kovalev, and Aleksandr I Panov. Don’t blind your vla: Aligning visual representations for ood generalization.arXiv preprint arXiv:2510.25616, 2025
-
[18]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation align- ment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025
-
[21]
Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiang- miao Pang, Yao Mu, and Ping Luo. Discrete diffu- sion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025
-
[22]
Onetwovla: A unified vision-language-action model with adaptive reasoning,
Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025
-
[23]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[25]
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language- action model.arXiv preprint arXiv:2503.10631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Rectified Flow: A Marginal Preserving Approach to Optimal Transport
Qiang Liu. Rectified flow: A marginal preserv- ing approach to optimal transport.arXiv preprint arXiv:2209.14577, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatial- reasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025
-
[29]
Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu- Xiong Wang, and Zhiding Yu. Locateanything3d: Vision- language 3d detection with chain-of-sight.arXiv preprint arXiv:2511.20648, 2025
-
[30]
Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spa- tiallm: Training large language models for structured in- door modeling.arXiv preprint arXiv:2506.07491, 2025
-
[31]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision- text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025
-
[33]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Qwen3-vl: A frontier multimodal large lan- guage model
Qwen Team. Qwen3-vl: A frontier multimodal large lan- guage model. https://github.com/QwenLM/Qwen3-VL,
-
[35]
Accessed: 2026-01-22
work page 2026
-
[36]
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xi- angyu Zhang, and Gao Huang. Memoryvla: Perceptual- cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Sun rgb-d: A rgb-d scene understanding benchmark suite
Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015
work page 2015
-
[38]
Qi Sun, Pengfei Hong, Tej Deep Pala, Vernon Toh, U-Xuan Tan, Deepanway Ghosal, and Soujanya Poria. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial rea- soning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14199–14214, 2025
work page 2025
-
[39]
Gemini Robotics: Bringing AI into the Physical World
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized ac- tion tokenizers.arXiv preprint arXiv:2507.01016, 2025
-
[42]
Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, and Dong Yu. N3d-vlm: Native 3d grounding enables accu- rate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025
-
[43]
Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,
Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Vlm-grounder: A vlm agent for zero-shot 3d visual grounding.arXiv preprint arXiv:2410.13860, 2024
-
[44]
Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
-
[45]
Instructvla: Vision-language-action instruction tuning from understanding to manipulation
Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520, 2025
-
[46]
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Sigmoid loss for language image pre- training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023
work page 2023
- [48]
-
[49]
Omni6dpose: A benchmark and model for universal 6d object pose estimation and tracking
Jiyao Zhang, Weiyao Huang, Bo Peng, Mingdong Wu, Fei Hu, Zijian Chen, Bo Zhao, and Hao Dong. Omni6dpose: A benchmark and model for universal 6d object pose estimation and tracking. InEuropean Con- ference on Computer Vision, pages 199–216. Springer, 2024
work page 2024
-
[50]
Cot-vla: Visual chain- of-thought reasoning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain- of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025
work page 2025
-
[51]
Chatvla: Unified multimodal understanding and robot control with vision- language-action model
Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision- language-action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025
work page 2025
-
[52]
Rt-2: Vision-language- action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. APPENDIX This supplemental material is organized as follows: In Section A, we provide...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.