Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Anzhe Chen; Chenxu L\"u; Dayiheng Liu; Delin Chen; Gengze Zhou; Hang Yin; Haoqi Yuan; Haoyang Li; Jian Guan; Jiazhao Zhang

arxiv: 2605.30280 · v2 · pith:IDRFEQHOnew · submitted 2026-05-28 · 💻 cs.RO · cs.AI· cs.CL

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qiuyue Wang , Mingsheng Li , Jian Guan , Jinhui Ye , Sicheng Xie , Yitao Liu , Junhao Chen , Zhixuan Liang

show 32 more authors

Jie Zhang Xintong Hu Xuhong Huang Pei Lin Junyang Lin Dayiheng Liu Shuai Bai Jingren Zhou Jiazhao Zhang Haoqi Yuan Gengze Zhou Hang Yin Ye Wang Yiyang Huang Zixing Lei Wujian Peng Delin Chen Yingming Zheng Jingyang Fan Xianwei Zhuang Xin Zhou Haoyang Li Anzhe Chen Tong Zhang Xuejing Liu Yuchong Sun Ruizhe Chen Zhaohai Li Chenxu L\"u Zhibo Yang Tao Yu Xionghui Chen

This is my paper

classification 💻 cs.RO cs.AIcs.CL

keywords manipulationrobotdatanavigationacrossactionembodiedenvironments

0 comments

read the original abstract

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LA4VLA: Learning to Act without Seeing via Language-Action Pretraining
cs.RO 2026-06 unverdicted novelty 6.0

LA4VLA pretrains on language-action pairs from decomposed demonstrations to create reusable action priors, yielding up to 45 percentage point gains in real-world VLA success rates when mixed with standard training.
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
cs.CV 2026-06 unverdicted novelty 6.0

ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.