Rethinking VLM Representation for VLA Initialization

Hao Li; Hongsheng Li; Jianbo Liu; Ruichuan An; Siyuan Huang; Tingwei Chen; Weifeng Lin; Xinyu Wei

arxiv: 2605.25802 · v1 · pith:ADW757Y4new · submitted 2026-05-25 · 💻 cs.CV

Rethinking VLM Representation for VLA Initialization

Weifeng Lin , Siyuan Huang , Hao Li , Tingwei Chen , Ruichuan An , Xinyu Wei , Jianbo Liu , Hongsheng Li This is my paper

Pith reviewed 2026-06-29 22:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords Vision-Language-ActionVLM initializationembodied VQALoRA adaptationrobot pretrainingpolicy backbone

0 comments

The pith

The original pretrained VLM representation is the main driver of action performance in VLA models, and targeted adaptations must preserve it rather than reshape it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to turn a pretrained vision-language model into a useful starting point for a vision-language-action policy. It runs controlled tests along three axes: adding embodied visual question answering data at different capability levels, choosing between full parameter updates and low-rank adaptation, and adding robot trajectory pretraining before the main task. Results indicate that the untouched pretrained representation already supplies useful structure for action learning. Embodied VQA gains appear only when they address specific downstream limits and do not combine in simple ways. Low-rank adaptation keeps the original representation more intact than full updates, and the best outcomes come from staged low-rank robot-data pretraining.

Core claim

Experiments demonstrate that the original pretrained VLM representation supplies a key source of action performance. Embodied VQA adaptation produces uneven benefits that hinge on existing bottlenecks in the task and do not add across domains. Low-rank adaptation yields more reliable initialization than full finetuning because aggressive reshaping weakens the representation. Adding robot-data pretraining improves results further, with the strongest initialization obtained through staged low-rank training that injects action-relevant signals while leaving the pretrained features largely unchanged.

What carries the argument

The VLM representation used as VLA initialization, tested through controlled variation along three axes of capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining.

If this is right

VLA training should prioritize methods that keep most of the original VLM weights fixed.
Full parameter updates risk erasing action-relevant structure already present in the pretrained model.
Robot trajectory data added via staged low-rank updates produces stronger initialization than either source alone.
Gains from different embodied supervision domains must be checked against the specific bottlenecks of the target task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar representation-preservation principles may apply when adapting other multimodal models to new output modalities.
Future work could test whether the same staged low-rank pattern holds when the downstream task involves longer-horizon planning or multi-robot coordination.
The finding suggests that VLA designers should measure how much the representation changes during adaptation rather than only tracking task reward.

Load-bearing premise

That varying only embodied VQA supervision, update strategy, and robot-data pretraining is enough to uncover general rules for choosing a VLM representation as VLA starting point.

What would settle it

A replication in which full finetuning on embodied VQA data consistently outperforms staged LoRA on the same downstream action tasks, or in which removing the original pretrained VLM weights improves rather than hurts policy success.

Figures

Figures reproduced from arXiv: 2605.25802 by Hao Li, Hongsheng Li, Jianbo Liu, Ruichuan An, Siyuan Huang, Tingwei Chen, Weifeng Lin, Xinyu Wei.

**Figure 1.** Figure 1: Overview of our study design. We evaluate the resulting initializations across multiple base VLMs, action heads, and simulated benchmarks. Our experiments reveal several significant and counterintuitive patterns. First, the original pretrained VLM representation is a major source of action performance, as policies trained from scratch drop by more than 20% across all benchmarks. Second, embodied VQA adapta… view at source ↗

**Figure 2.** Figure 2: Illustrations of the seven embodied VQA domains and two VLA architectures. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of parameter update strategies (LoRA vs. Full Finetune) on VLA initialization. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of base VLM backbones on adaptation. (a) Average success-rate changes relative to each base VLM’s own baseline. (b) Detailed success rates (%) for the three base VLMs. the auxiliary VQA task more strongly but loses substantially more general VLM capability and downstream VLA performance. These diagnostics strengthen our interpretation that useful VLA initialization depends not only on learning the a… view at source ↗

read the original abstract

Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper runs controlled comparisons on three axes for VLM-to-VLA transfer and finds LoRA plus staged robot pretraining beats full finetune or adding embodied VQA alone.

read the letter

The key point is that the original VLM representation stays useful for action, LoRA is more reliable than full finetuning because it avoids over-reshaping, and robot-data pretraining adds value in stages. Embodied VQA gains are not uniform or additive and depend on where the bottleneck is.

What the paper does is lay out three explicit axes—capability-level embodied VQA, update strategy, and robot pretraining—and report comparative outcomes from experiments. That is a straightforward extension of existing VLM adaptation work in robotics, and the non-additive finding plus the LoRA preference are concrete observations worth noting for practitioners.

The main limitation is that the abstract gives no dataset sizes, model scales, baseline details, or statistical controls, so it is hard to gauge how stable the trends are. The full text might fill this in, but based on what is visible the evidence is empirical and setup-dependent rather than broadly validated.

This is aimed at people already building or tuning VLAs who need practical initialization rules. It is not a new framework or theoretical result, but the experiments are focused enough that a serious referee could check the controls and numbers. I would send it for review rather than desk reject.

Referee Report

0 major / 2 minor

Summary. The paper frames VLA initialization as a controlled representation-design problem and studies it along three axes: capability-level embodied VQA supervision, parameter-update strategy (LoRA vs. full finetune), and robot-data pretraining. Experiments indicate that the original pretrained VLM representation remains a key source of action performance; embodied VQA adaptation yields non-uniform gains that depend on downstream bottlenecks and are not additive across domains; LoRA provides more reliable initialization than full finetuning; and the strongest results come from staged LoRA-based robot-data pretraining.

Significance. If the reported trends hold under the tested conditions, the work supplies concrete empirical guidance on VLM-to-VLA transfer that prioritizes preservation of pretrained representations while selectively injecting embodied and trajectory signals. The three-axis experimental design offers a reusable template for future ablation studies in this area.

minor comments (2)

[Abstract] Abstract: the summary of experimental findings supplies no information on datasets, model sizes, number of runs, or statistical controls, making it difficult for readers to gauge the scope and reliability of the stated trends without immediately consulting the main text.
[Introduction] The weakest-assumption paragraph in the introduction (or equivalent) should explicitly note that the three chosen axes are not claimed to be exhaustive and that interactions with other factors (e.g., action-head architecture, data scale) remain untested.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. The report accurately reflects our experimental findings on the three axes of VLA initialization.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a purely empirical study that reports outcomes from controlled experiments along three explicitly defined axes (embodied VQA supervision, parameter-update strategy, and robot-data pretraining). No equations, fitted parameters, predictions, or derivation chains are present that could reduce any claim to a self-referential construction or self-citation load-bearing step. The central findings are direct summaries of experimental results and remain independent of any internal redefinition or imported uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5736 in / 1262 out tokens · 44824 ms · 2026-06-29T22:19:35.489700+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 37 canonical work pages · 24 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al

URLhttps://arxiv.org/abs/2511.07403. Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page arXiv
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025a. Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhit...

work page arXiv
[8]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al

URL https://arxiv.org/abs/ 2505.15517. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024b. An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wan...

work page arXiv
[9]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing. arXiv preprint arXiv:2604.05014,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

doi:10.48550/arXiv.2505.03912 , abstract =

URLhttps://github.com/starVLA/starVLA. 10 Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912,

work page arXiv
[11]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Knowledge insulating vision-language- action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705,

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language- action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705,

work page arXiv
[13]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction- aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Towards visuospatial cognition via hierarchical fusion of visual experts.arXiv preprint arXiv:2505.12363, 2025a

Qi Feng. Towards visuospatial cognition via hierarchical fusion of visual experts.arXiv preprint arXiv:2505.12363, 2025a. Qi Feng. Towards visuospatial cognition via hierarchical fusion of visual experts, 2025b. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of ...

work page arXiv
[16]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Thinking with camera: A unified multimodal model for camera-centric understanding and generation.arXiv preprint arXiv:2510.08673,

11 Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, and Chen Change Loy. Thinking with camera: A unified multimodal model for camera-centric understanding and generation.arXiv preprint arXiv:2510.08673,

work page arXiv
[22]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606,

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606,

work page arXiv
[24]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

URL https://arxiv.org/abs/2503.15558. Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Ego- thinker: Unveiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569,

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, and Jiangmiao Pang. Ego- thinker: Unveiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569,

work page arXiv
[28]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112,

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112,

work page arXiv
[30]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bit- ton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029,

BAAI RoboBrain Team. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029,

work page arXiv
[33]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025a

Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, et al. Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025a. Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqia...

work page arXiv
[36]

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross- embodiment vision-language-action model.arXiv preprint arXiv:2510.10274,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308,

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308,

work page arXiv
[39]

The reported counts denote the available candidate pool after preprocessing. In Sec. 4.1.1, each single-domain model is trained with 800K samples, sampled uniformly across the source datasets within that domain. For multi-domain composition in Sec. 4.1.2, we keep a fixed total budget and sample evenly across the selected domains. Table 4: Stage-1 embodied...

2024
[40]

We use learning rate2×10 −5 for the VLM backbone 14 Table 5: Stage-2 training and evaluation settings for downstream VLA learning. Item Libero-10 SimplerBridge RoboCasa GR1 Tabletop Train Image resolution224×224 224×224 224×224 Optimizer AdamW AdamW AdamW AdamW betas [0.9, 0.999] [0.9, 0.999] [0.9, 0.999] LR 1e-5 5e-5 5e-5 LR schedule cosine with min_lr =...

2025

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [3]

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al

URLhttps://arxiv.org/abs/2511.07403. Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page arXiv

[3] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [7]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE, 2025a. Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhit...

work page arXiv

[7] [8]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al

URL https://arxiv.org/abs/ 2505.15517. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024b. An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wan...

work page arXiv

[8] [9]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing. arXiv preprint arXiv:2604.05014,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [10]

doi:10.48550/arXiv.2505.03912 , abstract =

URLhttps://github.com/starVLA/starVLA. 10 Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912,

work page arXiv

[10] [11]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

Knowledge insulating vision-language- action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705,

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language- action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705,

work page arXiv

[12] [13]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction- aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

Towards visuospatial cognition via hierarchical fusion of visual experts.arXiv preprint arXiv:2505.12363, 2025a

Qi Feng. Towards visuospatial cognition via hierarchical fusion of visual experts.arXiv preprint arXiv:2505.12363, 2025a. Qi Feng. Towards visuospatial cognition via hierarchical fusion of visual experts, 2025b. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of ...

work page arXiv

[15] [16]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [18]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [19]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [20]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [21]

Thinking with camera: A unified multimodal model for camera-centric understanding and generation.arXiv preprint arXiv:2510.08673,

11 Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, and Chen Change Loy. Thinking with camera: A unified multimodal model for camera-centric understanding and generation.arXiv preprint arXiv:2510.08673,

work page arXiv

[21] [22]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [23]

Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606,

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606,

work page arXiv

[23] [24]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [25]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

URL https://arxiv.org/abs/2503.15558. Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [26]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [27]

Ego- thinker: Unveiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569,

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, and Jiangmiao Pang. Ego- thinker: Unveiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569,

work page arXiv

[27] [28]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [29]

Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112,

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112,

work page arXiv

[29] [30]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [31]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bit- ton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [32]

Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029,

BAAI RoboBrain Team. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029,

work page arXiv

[32] [33]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [34]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [35]

Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025a

Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, et al. Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025a. Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqia...

work page arXiv

[35] [36]

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [37]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross- embodiment vision-language-action model.arXiv preprint arXiv:2510.10274,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [38]

Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308,

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308,

work page arXiv

[38] [39]

The reported counts denote the available candidate pool after preprocessing. In Sec. 4.1.1, each single-domain model is trained with 800K samples, sampled uniformly across the source datasets within that domain. For multi-domain composition in Sec. 4.1.2, we keep a fixed total budget and sample evenly across the selected domains. Table 4: Stage-1 embodied...

2024

[39] [40]

We use learning rate2×10 −5 for the VLM backbone 14 Table 5: Stage-2 training and evaluation settings for downstream VLA learning. Item Libero-10 SimplerBridge RoboCasa GR1 Tabletop Train Image resolution224×224 224×224 224×224 Optimizer AdamW AdamW AdamW AdamW betas [0.9, 0.999] [0.9, 0.999] [0.9, 0.999] LR 1e-5 5e-5 5e-5 LR schedule cosine with min_lr =...

2025