{"paper":{"title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Large VLM-based VLA models for robotic manipulation can be systematically classified into monolithic and hierarchical architectures.","cross_cats":["cs.CV"],"primary_cat":"cs.RO","authors_text":"Lingsen Zhang, Liqiang Nie, Ran Chen, Renshan Zhang, Rui Shao, Wei Li, Zhiyang Liu","submitted_at":"2025-08-18T16:45:48Z","abstract_excerpt":"Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delin"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation, resolving inconsistencies in existing taxonomies and filling a critical gap.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the proposed split into monolithic (single/dual-system) and hierarchical models, along with the listed integration domains, comprehensively captures the field without significant omissions or overlaps that would undermine the taxonomy's utility.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Large VLM-based VLA models for robotic manipulation can be systematically classified into monolithic and hierarchical architectures.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"fc66631ce0b964cbb57721970264499a773be95e5a38efcc2c61773771a66a60"},"source":{"id":"2508.13073","kind":"arxiv","version":2},"verdict":{"id":"3fa5fe66-7bb3-4295-9706-b07fc0eb4a08","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T20:24:02.001566Z","strongest_claim":"This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation, resolving inconsistencies in existing taxonomies and filling a critical gap.","one_line_summary":"This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the proposed split into monolithic (single/dual-system) and hierarchical models, along with the listed integration domains, comprehensively captures the field without significant omissions or overlaps that would undermine the taxonomy's utility.","pith_extraction_headline":"Large VLM-based VLA models for robotic manipulation can be systematically classified into monolithic and hierarchical architectures."},"references":{"count":243,"sample":[{"doi":"","year":2024,"title":"//arxiv.org/abs/2402.02385","work_id":"6ae9f09f-8664-4f8e-835f-07d5a9665e6c","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"A Survey on Vision-Language-Action Models for Embodied AI","work_id":"9492fb3d-d667-4892-81bb-b2878f12ff0c","ref_index":2,"cited_arxiv_id":"2405.14093","is_internal_anchor":true},{"doi":"","year":2025,"title":"Metaurban: a simulation platform for embodied ai in urban spaces,","work_id":"c19134ea-10c9-4a07-bc0f-46756c4e9aa0","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Generative artificial intelligence in robotic manipulation: A survey","work_id":"a71401c0-7d08-4253-9d71-0cb28f5db313","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Aligning cyber space with physical world: A comprehensive survey on embodied AI","work_id":"6546762e-7a35-47a7-8489-4ef27dc923a6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":243,"snapshot_sha256":"9893ec419f70645d4cfbbb3b72b29a947bfb5ff2e6af35b6a3d1bffb4c0947ca","internal_anchors":28},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}