{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:3TM73BSXD4H7I5353NPVHUX6HM","short_pith_number":"pith:3TM73BSX","schema_version":"1.0","canonical_sha256":"dcd9fd86571f0ff4777ddb5f53d2fe3b0b829472527471f8198f0dc3a6c6dc06","source":{"kind":"arxiv","id":"2507.01925","version":1},"attestation_state":"computed","paper":{"title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Vision-language-action models unify under one framework of action token chains from inputs to actions.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Fengshuo Bai, Ka Nam Lui, Shaofei Cai, Shaoyang Guo, Tianrui Guan, Xiaowei Zhang, Xuchuan Huang, Yaodong Yang, Yifan Zhong, Yitao Liang, Yuanfei Wang, Yuanpei Chen, Zhang Chen, Zhiquan Qi","submitted_at":"2025-07-02T17:34:52Z","abstract_excerpt":"The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \\textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actio"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2507.01925","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.RO","submitted_at":"2025-07-02T17:34:52Z","cross_cats_sorted":[],"title_canon_sha256":"55de44bf23e2520adab3a5805a62d82f2eff96566380ccc5cd38c9d8b684069c","abstract_canon_sha256":"1684ec07c21257a1da9c84eae86ef835a4a06bedfdb53bc256ef53935533bb40"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.882664Z","signature_b64":"OvcGtFipwjmSi8JsmEWfXVZXeymmBhJ+WyRVPlDBuGbIn0Dx+OIg2kqfPOax7nhwI9ttVJ6B6VmOk6MXEjVGDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"dcd9fd86571f0ff4777ddb5f53d2fe3b0b829472527471f8198f0dc3a6c6dc06","last_reissued_at":"2026-05-17T23:38:13.882054Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.882054Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Vision-language-action models unify under one framework of action token chains from inputs to actions.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Fengshuo Bai, Ka Nam Lui, Shaofei Cai, Shaoyang Guo, Tianrui Guan, Xiaowei Zhang, Xuchuan Huang, Yaodong Yang, Yifan Zhong, Yitao Liang, Yuanfei Wang, Yuanpei Chen, Zhang Chen, Zhiquan Qi","submitted_at":"2025-07-02T17:34:52Z","abstract_excerpt":"The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \\textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actio"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of action tokens that progressively encode more grounded and actionable information, ultimately generating executable actions.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Vision-language-action models unify under one framework of action token chains from inputs to actions.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c1d5ce7bc9f2c209b1339bde67ff1b95b72b8f2acd30708e21fed6255273950f"},"source":{"id":"2507.01925","kind":"arxiv","version":1},"verdict":{"id":"cbc52f2c-8685-4f10-98e5-450723c387a3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T14:03:08.728329Z","strongest_claim":"current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of action tokens that progressively encode more grounded and actionable information, ultimately generating executable actions.","one_line_summary":"The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning.","pith_extraction_headline":"Vision-language-action models unify under one framework of action token chains from inputs to actions."},"references":{"count":299,"sample":[{"doi":"","year":2021,"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","ref_index":1,"cited_arxiv_id":"2108.07258","is_internal_anchor":true},{"doi":"","year":2024,"title":"A comprehensive survey on pretrained foundation models: A history from bert to chatgpt","work_id":"cf38f9f9-4530-44e3-a1a3-9e3841a1ac24","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":3,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2025,"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","ref_index":4,"cited_arxiv_id":"2501.12948","is_internal_anchor":true},{"doi":"","year":2021,"title":"Learning transferable visual models from natural language supervision","work_id":"ad3e05b3-af3a-4fa2-ab30-c45f9f403277","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":299,"snapshot_sha256":"28775c9eead06585aa56a604600a68e33b5b450b8b035a869373ca9c434fdc69","internal_anchors":58},"formal_canon":{"evidence_count":2,"snapshot_sha256":"289b1af81091350dd6419362f7714d44788e27c556f0fae5291756dac588ec8a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2507.01925","created_at":"2026-05-17T23:38:13.882175+00:00"},{"alias_kind":"arxiv_version","alias_value":"2507.01925v1","created_at":"2026-05-17T23:38:13.882175+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2507.01925","created_at":"2026-05-17T23:38:13.882175+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":19,"internal_anchor_count":19,"sample":[{"citing_arxiv_id":"2511.15669","citing_title":"DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2512.16811","citing_title":"GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2602.05765","citing_title":"RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2602.20309","citing_title":"QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2603.12510","citing_title":"Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2509.09674","citing_title":"SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13276","citing_title":"D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09651","citing_title":"FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13276","citing_title":"D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04974","citing_title":"From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data","ref_index":111,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02600","citing_title":"CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00321","citing_title":"Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21241","citing_title":"CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11572","citing_title":"DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07288","citing_title":"Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04834","citing_title":"E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06932","citing_title":"Towards Multi-Object Nonprehensile Transportation via Shared Teleoperation: A Framework Based on Virtual Object Model Predictive Control","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16677","citing_title":"ReconVLA: An Uncertainty-Guided and Failure-Aware Vision-Language-Action Framework for Robotic Control","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02600","citing_title":"CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation","ref_index":34,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3TM73BSXD4H7I5353NPVHUX6HM","json":"https://pith.science/pith/3TM73BSXD4H7I5353NPVHUX6HM.json","graph_json":"https://pith.science/api/pith-number/3TM73BSXD4H7I5353NPVHUX6HM/graph.json","events_json":"https://pith.science/api/pith-number/3TM73BSXD4H7I5353NPVHUX6HM/events.json","paper":"https://pith.science/paper/3TM73BSX"},"agent_actions":{"view_html":"https://pith.science/pith/3TM73BSXD4H7I5353NPVHUX6HM","download_json":"https://pith.science/pith/3TM73BSXD4H7I5353NPVHUX6HM.json","view_paper":"https://pith.science/paper/3TM73BSX","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2507.01925&json=true","fetch_graph":"https://pith.science/api/pith-number/3TM73BSXD4H7I5353NPVHUX6HM/graph.json","fetch_events":"https://pith.science/api/pith-number/3TM73BSXD4H7I5353NPVHUX6HM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3TM73BSXD4H7I5353NPVHUX6HM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3TM73BSXD4H7I5353NPVHUX6HM/action/storage_attestation","attest_author":"https://pith.science/pith/3TM73BSXD4H7I5353NPVHUX6HM/action/author_attestation","sign_citation":"https://pith.science/pith/3TM73BSXD4H7I5353NPVHUX6HM/action/citation_signature","submit_replication":"https://pith.science/pith/3TM73BSXD4H7I5353NPVHUX6HM/action/replication_record"}},"created_at":"2026-05-17T23:38:13.882175+00:00","updated_at":"2026-05-17T23:38:13.882175+00:00"}