Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

Chenchen Yang; Chenxin Li; Jingjing Gong; Lei Xiao; Linqi Yin; Pengfang Qian; Shenling Qiu; Shiduo Zhang; Xiang Wang; Xipeng Qiu

arxiv: 2606.08520 · v1 · pith:A3IJGP36new · submitted 2026-06-07 · 💻 cs.RO

Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

Linqi Yin , Shiduo Zhang , Shenling Qiu , Chenxin Li , Zhaoyang Fu , Lei Xiao , Xiang Wang , Chenchen Yang

show 6 more authors

Zhe Xu Pengfang Qian Jingjing Gong Xipeng Qiu Xuanjing Huang Yu-Gang Jiang

This is my paper

Pith reviewed 2026-06-27 18:29 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language modelsvision-language-actionrobot policiesembodied datageneralizationfine-tuningtrajectory supervision

0 comments

The pith

Embodied trajectory-coupled data bridges VLMs to generalizable VLAs through gradual three-stage adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that directly fine-tuning vision-language models on robot action data forces the model across both visual perception and action prediction gaps at once, causing loss of pre-trained generalizations. It introduces embodied trajectory-coupled data drawn from the same robot scenes and trajectories but paired with language objectives as an intermediate form that shares visual context while keeping familiar supervision. A three-stage recipe first adapts the model to embodied visual-language semantics, then gradually shifts it toward action prediction while retaining representations, and finally specializes it to the target domain. Mixing task-relevant out-of-distribution ETC data with only a small amount of action data enables the resulting policy to handle novel visual-language conditions without collecting further robot demonstrations.

Core claim

Vision-language models can be turned into generalizable vision-language-action policies by using embodied trajectory-coupled data as a stepping stone that shares robot visual context while retaining language-understanding objectives; this enables a three-stage process of distribution bridging, objective bridging, and retentive adaptation, and mixing task-relevant out-of-distribution ETC data with limited action data transfers VLM generalizations into robust policies that succeed on novel conditions without additional demonstrations.

What carries the argument

Embodied trajectory-coupled (ETC) data: vision-language supervision derived from the same robot scenes and trajectories used for action learning.

If this is right

The model generalizes to novel visual-language conditions using only small amounts of action data.
Gradual bridging across distribution and objective gaps prevents degradation of VLM representations.
Three distinct stages are needed: first adapting to embodied visuals, then shifting objectives, then specializing to deployment.
Task-relevant out-of-distribution ETC data is effective for enabling generalization without new robot demonstrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradual bridging approach could lower the total volume of robot demonstrations required to deploy capable policies across varied tasks.
Similar intermediate data forms might help adapt other pre-trained models when their output modality changes.
Experiments that vary how closely the ETC data matches the deployment visuals would show how domain-specific the data must be.

Load-bearing premise

That ETC data shares visual context while retaining familiar language objectives and thus acts as a natural stepping stone that preserves rather than degrades VLM representations during the transition to action prediction.

What would settle it

Training a model with the three-stage process but without mixing task-relevant out-of-distribution ETC data and checking whether it still generalizes to novel visual-language conditions at the same rate as the mixed version.

read the original abstract

Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a three-stage recipe using embodied trajectory-coupled data to bridge VLMs to VLAs without the usual representation collapse.

read the letter

The core idea here is straightforward: direct fine-tuning of VLMs on robot actions is too big a jump, so they insert ETC data as an intermediate step. The three stages—Distribution Bridging to get the model used to robot scenes with language objectives, Objective Bridging to ease into action prediction, and Retentive Adaptation to lock in the target domain—plus the trick of mixing task-relevant out-of-distribution ETC data with a little action data, form a coherent training path.

What the work does well is lay out a clear empirical strategy that matches the stated problem. The motivation about preserving VLM generalizations is reasonable, and the abstract's description of simulation and real-robot results suggests the stages deliver measurable gains in generalization without extra demonstrations. The mixing experiment is a nice practical addition.

The soft spots are mostly around implementation details that the abstract leaves open. How the ETC data is actually constructed and labeled matters a lot for reproducibility, and the paper would be stronger with explicit ablations showing what happens when any one stage is skipped. The claim that ETC data acts as a natural stepping stone is plausible but rests on the assumption that shared visual context outweighs any distribution shift; that needs solid evidence in the full results.

This is aimed at people building VLAs who already work with large vision-language models. It is not a theoretical advance but a usable training recipe. The internal logic holds up and the experiments are cited as confirmation, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that direct fine-tuning of VLMs on robot action data is hindered by simultaneous gaps in visual distribution and training objective, leading to degraded generalization. It introduces embodied trajectory-coupled (ETC) data—vision-language supervision from the same robot scenes and trajectories—as an intermediate bridge. A three-stage recipe is proposed: Distribution Bridging adapts the VLM to embodied visual-language semantics; Objective Bridging gradually shifts toward action prediction while preserving representations; Retentive Adaptation specializes to the target domain. Experiments show that mixing task-relevant out-of-distribution ETC data with limited action data enables generalization to novel visual-language conditions without extra robot demonstrations, confirmed in simulation and real-robot settings.

Significance. If the empirical results hold, the work provides a concrete, data-efficient recipe for transferring VLM generalization to deployable VLAs. The emphasis on gradual bridging via ETC data and the mixing strategy for OOD generalization addresses a practical bottleneck in embodied AI, potentially lowering the data requirements for robust robot policies. The staged approach is a strength if the preservation of pretrained representations is demonstrated.

major comments (2)

[§3] §3 (three-stage recipe): the claim that Objective Bridging 'gradually shifts' the model while preserving representations requires explicit ablation showing that skipping this stage degrades performance relative to the full recipe; without such controls the necessity of the intermediate objective remains unproven.
[Experiments] Experiments section (mixing results): the reported generalization to novel conditions relies on 'task-relevant out-of-distribution ETC data'; the definition and selection criteria for what counts as 'task-relevant' must be stated precisely, as overly broad selection could inflate the apparent benefit of mixing.

minor comments (2)

[Abstract, §2] Abstract and §2: the term 'embodied trajectory-coupled (ETC) data' is introduced without a formal definition or example of the exact supervision format (e.g., caption style, trajectory encoding); a short illustrative example would improve clarity.
[Figures] Figure captions (throughout): several figures comparing VLA variants lack error bars or statistical significance markers despite the text claiming 'robust' improvements; adding these would strengthen the presentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the practical value of our approach, and the recommendation for minor revision. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (three-stage recipe): the claim that Objective Bridging 'gradually shifts' the model while preserving representations requires explicit ablation showing that skipping this stage degrades performance relative to the full recipe; without such controls the necessity of the intermediate objective remains unproven.

Authors: We agree that an explicit ablation is required to substantiate the necessity of the Objective Bridging stage. The revised manuscript will include a new ablation comparing the full three-stage recipe against a two-stage variant that omits Objective Bridging, reporting the resulting degradation in both in-distribution performance and out-of-distribution generalization to confirm that gradual objective shifting is essential for representation preservation. revision: yes
Referee: [Experiments] Experiments section (mixing results): the reported generalization to novel conditions relies on 'task-relevant out-of-distribution ETC data'; the definition and selection criteria for what counts as 'task-relevant' must be stated precisely, as overly broad selection could inflate the apparent benefit of mixing.

Authors: We acknowledge the need for a precise definition. In the revised Experiments section we will add an explicit subsection defining task-relevance via two quantitative criteria: (1) semantic overlap measured by cosine similarity of language embeddings between ETC trajectories and target tasks, and (2) visual distribution overlap computed via feature-space distance to the in-distribution robot scenes. This will make the selection process transparent and rule out overly broad inclusion. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with no derivations

full rationale

The paper describes an empirical three-stage training recipe (Distribution Bridging, Objective Bridging, Retentive Adaptation) using embodied trajectory-coupled (ETC) data to transition from VLMs to VLAs. No equations, first-principles derivations, fitted parameters, or mathematical reductions are present or claimed. The central claim rests on experimental results from simulation and real-robot tests showing generalization via mixing ETC and action data, without any self-definitional loops, fitted inputs called predictions, or load-bearing self-citations. The approach is self-contained as a practical training strategy validated externally to any internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that language-understanding objectives remain compatible with embodied visual contexts without further justification.

pith-pipeline@v0.9.1-grok · 5859 in / 1055 out tokens · 11984 ms · 2026-06-27T18:29:45.319278+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 15 linked inside Pith

[1]

Gemini robotics: Bring- ing ai into the physical world

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzĳl, et al. Gemini robotics: Bring- ing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025

Pith/arXiv arXiv 2025
[2]

Embodiedmidtrain: Bridging the gap between vision-language models and vision-language-action models via mid-training

Yiyang Du, Zhanqiu Guo, Xin Ye, Liu Ren, and Chenyan Xiong. Embodiedmidtrain: Bridging the gap between vision-language models and vision-language-action models via mid-training. arXiv preprint arXiv:2604.20012, 2026

Pith/arXiv arXiv 2026
[3]

Igniting vlms toward the embodied space

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766, 2025

arXiv 2025
[4]

Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, et al. Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint arXiv:2602.12684, 2026

arXiv 2026
[5]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[6]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Haus- man, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pe...

Pith/arXiv arXiv 2025
[7]

A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation

Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation. arXiv preprint arXiv:2602.01067, 2026

arXiv 2026
[8]

Chatvla: Unified multimodal understanding and robot control with vision-language-action model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 5377–5395, 2025

2025
[9]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery , Brian Ichter, Ayzaan Wahid, Jonathan T ompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

Pith/arXiv arXiv 2023
[10]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1724–1734, 2025

2025
[11]

Eo-1: An open unified embodied foundation model for general robot control, 2026

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang, Xinyi Ye, Qi Lv , Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, and Xuelong Li. Eo-1: An open unified embodied foundation model for general robot control, 2026. URL https://arxiv.org/abs/2508.21112

arXiv 2026
[12]

Gr-3 technical report

Chilam Cheang, Sĳin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report. arXiv preprint arXiv:2507.15493, 2025

Pith/arXiv arXiv 2025
[13]

Galaxea open-world dataset and g0 dual-system vla model

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model. arXiv preprint arXiv:2509.00576, 2025

arXiv 2025
[14]

Knowledge insulating vision-language-action models: Train fast, run fast, gener- alize better

Danny Driess, Jost Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, gener- alize better. Advances in Neural Information Processing Systems, 38:102867–102888, 2026

2026
[15]

A pragmatic vla foundation model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026. 10

Pith/arXiv arXiv 2026
[16]

𝜋0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[17]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev , Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[18]

Faster: T oward efficient autoregressive vision language action modeling via neural action tokenization

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baĳun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: T oward efficient autoregressive vision language action modeling via neural action tokenization. arXiv preprint arXiv:2512.04952, 2025

arXiv 2025
[19]

Vlm4vla: Revisiting vision-language-models in vision-language-action models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309, 2026

Pith/arXiv arXiv 2026
[20]

Actions as language: Fine- tuning vlms into vlas without catastrophic forgetting

Asher J Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky , and Anirudha Majumdar. Actions as language: Fine- tuning vlms into vlas without catastrophic forgetting. arXiv preprint arXiv:2509.22195, 2025

arXiv 2025
[21]

Internvla-a1: Unifying understanding, generation and action for robotic manipulation

Junhao Cai, Zetao Cai, Jiafei Cao, Yilun Chen, Zeyu He, Lei Jiang, Hang Li, Hengjie Li, Yang Li, Yufei Liu, et al. Internvla-a1: Unifying understanding, generation and action for robotic manipulation. arXiv preprint arXiv:2601.02456, 2026

arXiv 2026
[22]

Sanketi, and Ken Goldberg

Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R. Sanketi, and Ken Goldberg. Robo2VLM: Visual question answering from large-scale in-the-wild robot manipulation datasets. arXiv preprint arXiv:2505.15517, 2025

arXiv 2025
[23]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 645–652. IEEE, 2024

2024
[24]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 346–355, 2024

2024
[25]

Roborefer: T owards spatial referring with reasoning in vision-language models for robotics

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: T owards spatial referring with reasoning in vision-language models for robotics. Advances in Neural Information Processing Systems, 38:28404–28481, 2026

2026
[26]

Robopoint: A vision-language model for spatial affordance prediction for robotics

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay , Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024

arXiv 2024
[27]

Molmoact: Action reasoning models that can reason in space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. arXiv preprint arXiv:2508.07917 , 2025

Pith/arXiv arXiv 2025
[28]

Robobrain 2.0 technical report

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029, 2025

arXiv 2025
[29]

Mimo-embodied: X-embodied foundation model technical report

Xiaoshuai Hao, Lei Zhou, Zhĳian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, et al. Mimo-embodied: X-embodied foundation model technical report. arXiv preprint arXiv:2511.16518, 2025

Pith/arXiv arXiv 2025
[30]

Unify robot actions in camera frame, 2026

Sicheng Xie, Lingchen Meng, Zĳie Diao, Haidong Cao, Zhiying Du, Shuyuan T u, Jiaqi Leng, Qiuyue Wang, Mingsheng Li, Shuai Bai, Zuxuan Wu, and Yu-Gang Jiang. Unify robot actions in camera frame, 2026. URL https://arxiv.org/abs/2511.17001

Pith/arXiv arXiv 2026
[31]

Paligemma: A versatile 3b vlm for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov , Xiao Wang, Daniel Salz, Maxim Neu- mann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024. 11

Pith/arXiv arXiv 2024
[32]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025
[33]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems , 36:44776–44791, 2023

2023
[34]

Evaluating real-world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024

Pith/arXiv arXiv 2024
[35]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks

Shiduo Zhang, Zhe Xu, Peĳu Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142– 11152, 2025

2025
[36]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, T ony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023

2023
[37]

Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models

Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, and Lin Shao. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models. Advances in Neural Information Processing Systems, 38:136705–136736, 2026

2026
[38]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European conference on computer vision , pages 146–162. Springer, 2022

2022
[39]

Microsoft coco: Common objects in context

Tsung- Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

2014
[40]

COL", positioned in the center-right of the scene, to the left of the

Lukas Blecher. LaTeX-OCR: pix2tex – using a ViT to convert images of equations into LaTeX code. https://github. com/lukas-blecher/LaTeX-OCR , 2022. Software repository , accessed 2026-05-28. 12 A Implementation Details A.1 Co-training Strategy During both Objective Bridging and Retentive Adaptation, we co-train on ETC and action data. Each opti- mization ...

2022
[41]

bike” → “bicycle

The continuous score is exp(− RMSE/ 𝜏) with 𝜏 = 20 px; the example is correct if RMSE ≤ 20 px. D.1.6 COCO Joint Detection F1 For COCO, we use a dataset-specific metric that jointly requires category-label agreement and BBox overlap. Class names are normalized to the 80 COCO categories (e.g., “bike” → “bicycle”). Predictions and ground- truth objects are m...

[1] [1]

Gemini robotics: Bring- ing ai into the physical world

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzĳl, et al. Gemini robotics: Bring- ing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025

Pith/arXiv arXiv 2025

[2] [2]

Embodiedmidtrain: Bridging the gap between vision-language models and vision-language-action models via mid-training

Yiyang Du, Zhanqiu Guo, Xin Ye, Liu Ren, and Chenyan Xiong. Embodiedmidtrain: Bridging the gap between vision-language models and vision-language-action models via mid-training. arXiv preprint arXiv:2604.20012, 2026

Pith/arXiv arXiv 2026

[3] [3]

Igniting vlms toward the embodied space

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766, 2025

arXiv 2025

[4] [4]

Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, et al. Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint arXiv:2602.12684, 2026

arXiv 2026

[5] [5]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[6] [6]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Haus- man, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pe...

Pith/arXiv arXiv 2025

[7] [7]

A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation

Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation. arXiv preprint arXiv:2602.01067, 2026

arXiv 2026

[8] [8]

Chatvla: Unified multimodal understanding and robot control with vision-language-action model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 5377–5395, 2025

2025

[9] [9]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery , Brian Ichter, Ayzaan Wahid, Jonathan T ompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

Pith/arXiv arXiv 2023

[10] [10]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1724–1734, 2025

2025

[11] [11]

Eo-1: An open unified embodied foundation model for general robot control, 2026

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang, Xinyi Ye, Qi Lv , Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, and Xuelong Li. Eo-1: An open unified embodied foundation model for general robot control, 2026. URL https://arxiv.org/abs/2508.21112

arXiv 2026

[12] [12]

Gr-3 technical report

Chilam Cheang, Sĳin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report. arXiv preprint arXiv:2507.15493, 2025

Pith/arXiv arXiv 2025

[13] [13]

Galaxea open-world dataset and g0 dual-system vla model

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model. arXiv preprint arXiv:2509.00576, 2025

arXiv 2025

[14] [14]

Knowledge insulating vision-language-action models: Train fast, run fast, gener- alize better

Danny Driess, Jost Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, gener- alize better. Advances in Neural Information Processing Systems, 38:102867–102888, 2026

2026

[15] [15]

A pragmatic vla foundation model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026. 10

Pith/arXiv arXiv 2026

[16] [16]

𝜋0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[17] [17]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev , Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[18] [18]

Faster: T oward efficient autoregressive vision language action modeling via neural action tokenization

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baĳun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: T oward efficient autoregressive vision language action modeling via neural action tokenization. arXiv preprint arXiv:2512.04952, 2025

arXiv 2025

[19] [19]

Vlm4vla: Revisiting vision-language-models in vision-language-action models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309, 2026

Pith/arXiv arXiv 2026

[20] [20]

Actions as language: Fine- tuning vlms into vlas without catastrophic forgetting

Asher J Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky , and Anirudha Majumdar. Actions as language: Fine- tuning vlms into vlas without catastrophic forgetting. arXiv preprint arXiv:2509.22195, 2025

arXiv 2025

[21] [21]

Internvla-a1: Unifying understanding, generation and action for robotic manipulation

Junhao Cai, Zetao Cai, Jiafei Cao, Yilun Chen, Zeyu He, Lei Jiang, Hang Li, Hengjie Li, Yang Li, Yufei Liu, et al. Internvla-a1: Unifying understanding, generation and action for robotic manipulation. arXiv preprint arXiv:2601.02456, 2026

arXiv 2026

[22] [22]

Sanketi, and Ken Goldberg

Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R. Sanketi, and Ken Goldberg. Robo2VLM: Visual question answering from large-scale in-the-wild robot manipulation datasets. arXiv preprint arXiv:2505.15517, 2025

arXiv 2025

[23] [23]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 645–652. IEEE, 2024

2024

[24] [24]

Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 346–355, 2024

2024

[25] [25]

Roborefer: T owards spatial referring with reasoning in vision-language models for robotics

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: T owards spatial referring with reasoning in vision-language models for robotics. Advances in Neural Information Processing Systems, 38:28404–28481, 2026

2026

[26] [26]

Robopoint: A vision-language model for spatial affordance prediction for robotics

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay , Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024

arXiv 2024

[27] [27]

Molmoact: Action reasoning models that can reason in space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. arXiv preprint arXiv:2508.07917 , 2025

Pith/arXiv arXiv 2025

[28] [28]

Robobrain 2.0 technical report

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029, 2025

arXiv 2025

[29] [29]

Mimo-embodied: X-embodied foundation model technical report

Xiaoshuai Hao, Lei Zhou, Zhĳian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, et al. Mimo-embodied: X-embodied foundation model technical report. arXiv preprint arXiv:2511.16518, 2025

Pith/arXiv arXiv 2025

[30] [30]

Unify robot actions in camera frame, 2026

Sicheng Xie, Lingchen Meng, Zĳie Diao, Haidong Cao, Zhiying Du, Shuyuan T u, Jiaqi Leng, Qiuyue Wang, Mingsheng Li, Shuai Bai, Zuxuan Wu, and Yu-Gang Jiang. Unify robot actions in camera frame, 2026. URL https://arxiv.org/abs/2511.17001

Pith/arXiv arXiv 2026

[31] [31]

Paligemma: A versatile 3b vlm for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov , Xiao Wang, Daniel Salz, Maxim Neu- mann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024. 11

Pith/arXiv arXiv 2024

[32] [32]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025

[33] [33]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems , 36:44776–44791, 2023

2023

[34] [34]

Evaluating real-world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024

Pith/arXiv arXiv 2024

[35] [35]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks

Shiduo Zhang, Zhe Xu, Peĳu Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142– 11152, 2025

2025

[36] [36]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, T ony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023

2023

[37] [37]

Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models

Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, and Lin Shao. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models. Advances in Neural Information Processing Systems, 38:136705–136736, 2026

2026

[38] [38]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European conference on computer vision , pages 146–162. Springer, 2022

2022

[39] [39]

Microsoft coco: Common objects in context

Tsung- Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

2014

[40] [40]

COL", positioned in the center-right of the scene, to the left of the

Lukas Blecher. LaTeX-OCR: pix2tex – using a ViT to convert images of equations into LaTeX code. https://github. com/lukas-blecher/LaTeX-OCR , 2022. Software repository , accessed 2026-05-28. 12 A Implementation Details A.1 Co-training Strategy During both Objective Bridging and Retentive Adaptation, we co-train on ETC and action data. Each opti- mization ...

2022

[41] [41]

bike” → “bicycle

The continuous score is exp(− RMSE/ 𝜏) with 𝜏 = 20 px; the example is correct if RMSE ≤ 20 px. D.1.6 COCO Joint Detection F1 For COCO, we use a dataset-specific metric that jointly requires category-label agreement and BBox overlap. Class names are normalized to the 80 COCO categories (e.g., “bike” → “bicycle”). Predictions and ground- truth objects are m...