hub Canonical reference

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen · 2025 · cs.RO · arXiv 2508.13073

Canonical reference. 75% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1 method 1

citation-polarity summary

background 6 baseline 1 use method 1

representative citing papers

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

cs.RO · 2026-05-18 · unverdicted · novelty 7.0

ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

cs.RO · 2026-04-19 · unverdicted · novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing

cs.RO · 2026-05-05 · unverdicted · novelty 6.0

ScanHD achieves 92.7% exact accuracy and 98.1% Win@1 accuracy in recommending discrete scanning parameters from instructions and images on a new real-world dataset.

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

cs.AI · 2026-04-12 · unverdicted · novelty 6.0 · 2 refs

TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.

FASTER: Rethinking Real-Time Flow VLAs

cs.RO · 2026-03-19 · unverdicted · novelty 6.0 · 2 refs

FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

cs.RO · 2026-02-22 · unverdicted · novelty 6.0

OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

cs.CV · 2025-11-21 · conditional · novelty 6.0

VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.

Dynamic Execution Commitment of Vision-Language-Action Models

cs.CV · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

A3 adaptively selects verifiable action prefixes in VLA models using group-sampled consensus and conditional re-decoding to balance robustness and speed without manual horizon tuning.

A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory

cs.RO · 2026-05-04 · unverdicted · novelty 5.0

The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000-fold latency reduction on Raspberry Pi 5 robots with no GPU or training data.

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

cs.CV · 2026-04-15 · unverdicted · novelty 5.0 · 2 refs

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development

cs.RO · 2026-04-15 · unverdicted · novelty 5.0

EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.

Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization

cs.RO · 2026-04-15 · unverdicted · novelty 5.0

EEAgent with LSTRO sets new state-of-the-art results on six VIMA-Bench robotic manipulation tasks by dynamically refining prompts through reflection on successes and failures.

CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

cs.RO · 2026-04-07 · unverdicted · novelty 5.0

CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.

Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

cs.AI · 2025-12-29 · unverdicted · novelty 5.0

A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.

RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation

cs.RO · 2025-10-20 · unverdicted · novelty 5.0

RESample uses exploratory sampling guided by a lightweight Coverage Function to expand VLA training data coverage, yielding 12% performance gains on LIBERO and real-world tasks with 10-20% added samples.

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.

Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

cs.RO · 2026-04-24 · unverdicted · novelty 3.0

A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

cs.CV · 2026-05-21

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

cs.RO · 2026-04-04

citing papers explorer

Showing 23 of 23 citing papers.

Point Tracking Improves World Action Models cs.RO · 2026-05-22 · unverdicted · none · ref 5 · internal anchor
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics cs.RO · 2026-05-18 · unverdicted · none · ref 6 · internal anchor
ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies cs.CV · 2026-04-27 · unverdicted · none · ref 39 · internal anchor
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning cs.RO · 2026-04-19 · unverdicted · none · ref 17 · internal anchor
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data cs.RO · 2026-05-13 · unverdicted · none · ref 73 · internal anchor
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 64 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing cs.RO · 2026-05-05 · unverdicted · none · ref 26 · internal anchor
ScanHD achieves 92.7% exact accuracy and 98.1% Win@1 accuracy in recommending discrete scanning parameters from instructions and images on a new real-world dataset.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training cs.AI · 2026-04-12 · unverdicted · none · ref 20 · 2 links · internal anchor
TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.
FASTER: Rethinking Real-Time Flow VLAs cs.RO · 2026-03-19 · unverdicted · none · ref 70 · 2 links · internal anchor
FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation cs.RO · 2026-02-22 · unverdicted · none · ref 41 · internal anchor
OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions cs.CV · 2025-11-21 · conditional · none · ref 19 · internal anchor
VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.
Dynamic Execution Commitment of Vision-Language-Action Models cs.CV · 2026-05-12 · unverdicted · none · ref 6 · 2 links · internal anchor
A3 adaptively selects verifiable action prefixes in VLA models using group-sampled consensus and conditional re-decoding to balance robustness and speed without manual horizon tuning.
A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory cs.RO · 2026-05-04 · unverdicted · none · ref 8 · internal anchor
The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000-fold latency reduction on Raspberry Pi 5 robots with no GPU or training data.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System cs.CV · 2026-04-15 · unverdicted · none · ref 30 · 2 links · internal anchor
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development cs.RO · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization cs.RO · 2026-04-15 · unverdicted · none · ref 21 · internal anchor
EEAgent with LSTRO sets new state-of-the-art results on six VIMA-Bench robotic manipulation tasks by dynamically refining prompts through reflection on successes and failures.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment cs.RO · 2026-04-07 · unverdicted · none · ref 47 · internal anchor
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control cs.AI · 2025-12-29 · unverdicted · none · ref 37 · internal anchor
A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.
RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation cs.RO · 2025-10-20 · unverdicted · none · ref 17 · internal anchor
RESample uses exploratory sampling guided by a lightweight Coverage Function to expand VLA training data coverage, yielding 12% performance gains on LIBERO and real-world tasks with 10-20% added samples.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 85 · internal anchor
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines cs.RO · 2026-04-24 · unverdicted · none · ref 19 · internal anchor
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model cs.CV · 2026-05-21 · unreviewed · ref 19 · internal anchor
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data cs.RO · 2026-04-04 · unreviewed · ref 82 · internal anchor

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer