{"paper":{"title":"Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.RO","authors_text":"Anzhe Chen, Chenxu L\\\"u, Dayiheng Liu, Delin Chen, Gengze Zhou, Hang Yin, Haoqi Yuan, Haoyang Li, Jian Guan, Jiazhao Zhang, Jie Zhang, Jingren Zhou, Jingyang Fan, Jinhui Ye, Junhao Chen, Junyang Lin, Mingsheng Li, Pei Lin, Qiuyue Wang, Ruizhe Chen, Shuai Bai, Sicheng Xie, Tao Yu, Tong Zhang, Wujian Peng, Xianwei Zhuang, Xintong Hu, Xin Zhou, Xionghui Chen, Xuejing Liu, Xuhong Huang, Ye Wang, Yingming Zheng, Yitao Liu, Yiyang Huang, Yuchong Sun, Zhaohai Li, Zhibo Yang, Zhixuan Liang, Zixing Lei","submitted_at":"2026-05-28T17:36:31Z","abstract_excerpt":"Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based a"},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"2605.30280","kind":"arxiv","version":1},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.30280/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}