{"paper":{"title":"RotVLA: Rotational Latent Action for Vision-Language-Action Model","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"RotVLA replaces discrete action codes with continuous rotations in SO(n) for vision-language-action models.","cross_cats":["cs.CV"],"primary_cat":"cs.RO","authors_text":"Hangjun Ye, Jiahuan Zhou, Peiyan Li, Qiwei Li, Quanyun Zhou, Xicheng Gong, Xinghang Li, Yadong Mu","submitted_at":"2026-05-13T11:58:02Z","abstract_excerpt":"Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing conti"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That representing latent actions as elements of SO(n) together with a triplet-frame objective inherently supplies continuity, compositionality, and physically meaningful structure while preventing trivial frame-reconstruction solutions.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"RotVLA replaces discrete action codes with continuous rotations in SO(n) for vision-language-action models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a9c8e1113c85f322833bd3fa93239dafd3ee05470983c42e33923972ba7d27b0"},"source":{"id":"2605.13403","kind":"arxiv","version":1},"verdict":{"id":"7f51c42f-7b3a-4437-a206-a8768d87e6b6","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T17:44:25.288826Z","strongest_claim":"With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.","one_line_summary":"RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That representing latent actions as elements of SO(n) together with a triplet-frame objective inherently supplies continuity, compositionality, and physically meaningful structure while preventing trivial frame-reconstruction solutions.","pith_extraction_headline":"RotVLA replaces discrete action codes with continuous rotations in SO(n) for vision-language-action models."},"references":{"count":86,"sample":[{"doi":"","year":2024,"title":"A Survey on Vision-Language-Action Models for Embodied AI","work_id":"9492fb3d-d667-4892-81bb-b2878f12ff0c","ref_index":1,"cited_arxiv_id":"2405.14093","is_internal_anchor":true},{"doi":"","year":2025,"title":"Roumelio- tis, and Manoj Karkee","work_id":"5f0bf2cc-1901-4ad0-940b-1e742cc6d7e7","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916","work_id":"115823a2-8918-4227-8872-3d0a36ff07a9","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks","work_id":"321b2bd4-950a-44f0-ab50-e70251e75187","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":5,"cited_arxiv_id":"2502.13923","is_internal_anchor":true}],"resolved_work":86,"snapshot_sha256":"114c3a6b5eda750fac0cfaa8fbe9b84492ddf74d37e389105a4d0f5a207e0207","internal_anchors":28},"formal_canon":{"evidence_count":2,"snapshot_sha256":"61af6b48e18519772de59e4d3a20be8a02b289c706a90aad0c87e7dc08b7a177"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}