{"work":{"id":"bcbdae00-80a0-4488-a280-7eaf3c5336bb","openalex_id":null,"doi":null,"arxiv_id":"2508.13073","raw_key":null,"title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","authors":null,"authors_text":"Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen","year":2025,"venue":"cs.RO","abstract":"Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation","external_url":"https://arxiv.org/abs/2508.13073","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T03:56:37.125436+00:00","pith_arxiv_id":"2508.13073","created_at":"2026-05-10T05:51:10.387732+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","render_title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey"},"hub":{"state":{"work_id":"bcbdae00-80a0-4488-a280-7eaf3c5336bb","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":23,"external_cited_by_count":null,"distinct_field_count":3,"first_pith_cited_at":"2025-10-20T15:21:12+00:00","last_pith_cited_at":"2026-05-22T17:08:37+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-07T17:32:17.729896+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":6},{"context_role":"baseline","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":6},{"context_polarity":"baseline","n":1},{"context_polarity":"use_method","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}