{"paper":{"title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Vision-language-action models unify under one framework of action token chains from inputs to actions.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Fengshuo Bai, Ka Nam Lui, Shaofei Cai, Shaoyang Guo, Tianrui Guan, Xiaowei Zhang, Xuchuan Huang, Yaodong Yang, Yifan Zhong, Yitao Liang, Yuanfei Wang, Yuanpei Chen, Zhang Chen, Zhiquan Qi","submitted_at":"2025-07-02T17:34:52Z","abstract_excerpt":"The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \\textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actio"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of action tokens that progressively encode more grounded and actionable information, ultimately generating executable actions.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Vision-language-action models unify under one framework of action token chains from inputs to actions.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c1d5ce7bc9f2c209b1339bde67ff1b95b72b8f2acd30708e21fed6255273950f"},"source":{"id":"2507.01925","kind":"arxiv","version":1},"verdict":{"id":"cbc52f2c-8685-4f10-98e5-450723c387a3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T14:03:08.728329Z","strongest_claim":"current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of action tokens that progressively encode more grounded and actionable information, ultimately generating executable actions.","one_line_summary":"The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning.","pith_extraction_headline":"Vision-language-action models unify under one framework of action token chains from inputs to actions."},"references":{"count":299,"sample":[{"doi":"","year":2021,"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","ref_index":1,"cited_arxiv_id":"2108.07258","is_internal_anchor":true},{"doi":"","year":2024,"title":"A comprehensive survey on pretrained foundation models: A history from bert to chatgpt","work_id":"cf38f9f9-4530-44e3-a1a3-9e3841a1ac24","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":3,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2025,"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","ref_index":4,"cited_arxiv_id":"2501.12948","is_internal_anchor":true},{"doi":"","year":2021,"title":"Learning transferable visual models from natural language supervision","work_id":"ad3e05b3-af3a-4fa2-ab30-c45f9f403277","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":299,"snapshot_sha256":"28775c9eead06585aa56a604600a68e33b5b450b8b035a869373ca9c434fdc69","internal_anchors":58},"formal_canon":{"evidence_count":2,"snapshot_sha256":"289b1af81091350dd6419362f7714d44788e27c556f0fae5291756dac588ec8a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}