{"paper":{"title":"DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control","license":"http://creativecommons.org/licenses/by/4.0/","headline":"DexVLA plugs a billion-parameter diffusion expert pre-trained across robot bodies into vision-language models for language-driven control on new embodiments.","cross_cats":["cs.CV"],"primary_cat":"cs.RO","authors_text":"Chaomin Shen, Feifei Feng, Jinming Li, Junjie Wen, Yichen Zhu, Zhibin Tang","submitted_at":"2025-02-09T11:25:56Z","abstract_excerpt":"Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for comp"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"DexVLA demonstrates superior performance compared to state-of-the-art models like Octo, OpenVLA, and Diffusion Policy across multiple embodiments for complex, long-horizon tasks using only direct language prompting.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That pre-training the diffusion expert on cross-embodiment data produces action representations that transfer effectively when plugged into a new VLA without requiring embodiment-specific action fine-tuning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"DexVLA plugs a billion-parameter diffusion expert pre-trained across robot bodies into vision-language models for language-driven control on new embodiments.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4fc13dbc4d70b085e9d17d4f0fba24f3b81c9143670db1e4b6b4019057453d63"},"source":{"id":"2502.05855","kind":"arxiv","version":3},"verdict":{"id":"95b2b460-5def-4f64-9661-eff5a864fc18","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T19:43:39.917656Z","strongest_claim":"DexVLA demonstrates superior performance compared to state-of-the-art models like Octo, OpenVLA, and Diffusion Policy across multiple embodiments for complex, long-horizon tasks using only direct language prompting.","one_line_summary":"DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That pre-training the diffusion expert on cross-embodiment data produces action representations that transfer effectively when plugged into a new VLA without requiring embodiment-specific action fine-tuning.","pith_extraction_headline":"DexVLA plugs a billion-parameter diffusion expert pre-trained across robot bodies into vision-language models for language-driven control on new embodiments."},"references":{"count":72,"sample":[{"doi":"","year":2024,"title":"Learning visuotactile skills with two multifingered hands","work_id":"f7dba6c3-e425-4afe-ae0f-70a0a285540b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Robocook: Long-horizon elasto-plastic object manipulation with diverse tools","work_id":"299cb136-afd1-4545-8a31-80f006fb7bfd","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Learning manipulation skills through robot chain-of-thought with sparse failure guidance","work_id":"ee610349-1b86-4b8b-92b0-653f19d91593","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasp- 9 ing and cross-","work_id":"f6500a83-997c-40b8-bd1a-88398ce208d8","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, pages 570–587. Sp","work_id":"f62be0a4-a743-4391-a173-89951bc9b2a8","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":72,"snapshot_sha256":"9a29ca6695b1a3ab61d3c8427522dc21496315288edf40d42f5073469b1d0381","internal_anchors":25},"formal_canon":{"evidence_count":2,"snapshot_sha256":"574ccfec6c4dfa9f48901017658c17d4887dc99cd0a3dcea92437c7b5a7ada3e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}