hub Canonical reference

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen · 2025 · cs.RO · arXiv 2508.13073

Canonical reference. 75% of citing Pith papers cite this work as background.

29 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 29 citing papers arXiv PDF

abstract

Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1 method 1

citation-polarity summary

background 6 baseline 1 use method 1

representative citing papers

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

cs.RO · 2026-05-18 · unverdicted · novelty 7.0

ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.

Dynamic Execution Commitment of Vision-Language-Action Models

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 3 refs

A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

cs.RO · 2026-04-19 · unverdicted · novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

PolicyTrim is an RL post-training framework that boosts VLA policy efficiency by 3x chunk utilization and 51.4% fewer steps, yielding up to 5.83x speedup.

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

cs.RO · 2026-06-11 · unverdicted · novelty 6.0

GRASP maps natural language to bounding-box goals via VLM for neuro-symbolic planning and reports 73.3% success in 90 real-robot trials without task-specific training.

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing

cs.RO · 2026-05-05 · unverdicted · novelty 6.0

ScanHD achieves 92.7% exact accuracy and 98.1% Win@1 accuracy in recommending discrete scanning parameters from instructions and images on a new real-world dataset.

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

cs.AI · 2026-04-12 · unverdicted · novelty 6.0 · 2 refs

TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.

FASTER: Rethinking Real-Time Flow VLAs

cs.RO · 2026-03-19 · unverdicted · novelty 6.0 · 2 refs

FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

cs.RO · 2026-02-22 · unverdicted · novelty 6.0

OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

cs.CV · 2025-11-21 · conditional · novelty 6.0

VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.

ESC: Emotional Self-Correction for Reliable Vision-Language Models

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

ESC uses emotional cues triggered by an external verifier to enable training-free self-correction in VLMs, improving reliability on safety, hallucination, and reasoning benchmarks.

GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors

cs.RO · 2026-06-07 · unverdicted · novelty 5.0

GraspFoM creates a shared 3D latent from SAM3D priors, adds an anchor-initialized diffuser for multimodal grasps, and uses reconstruction-aware scoring plus residual updates to jointly achieve SOTA reconstruction and grasping with few extra parameters.

World Models for Robotic Manipulation: A Survey

cs.RO · 2026-05-27 · accept · novelty 5.0

Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.

VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-BasedData Synthesis

cs.RO · 2026-05-16 · unverdicted · novelty 5.0

VLAMotor exposes VLA failures via distance-aware uncertainty testing and synthesizes agent-planned repair data to fine-tune models, reporting 49.25% success rate gains in simulation and 57.5% on hardware.

A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory

cs.RO · 2026-05-04 · unverdicted · novelty 5.0

The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000-fold latency reduction on Raspberry Pi 5 robots with no GPU or training data.

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

cs.CV · 2026-04-15 · unverdicted · novelty 5.0 · 2 refs

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development

cs.RO · 2026-04-15 · unverdicted · novelty 5.0

EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.

Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization

cs.RO · 2026-04-15 · unverdicted · novelty 5.0

EEAgent with LSTRO sets new state-of-the-art results on six VIMA-Bench robotic manipulation tasks by dynamically refining prompts through reflection on successes and failures.

CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

cs.RO · 2026-04-07 · unverdicted · novelty 5.0

CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.

Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

cs.AI · 2025-12-29 · unverdicted · novelty 5.0

A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.

citing papers explorer

Showing 29 of 29 citing papers.

Point Tracking Improves World Action Models cs.RO · 2026-05-22 · unverdicted · none · ref 5 · internal anchor
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics cs.RO · 2026-05-18 · unverdicted · none · ref 6 · internal anchor
ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.
Dynamic Execution Commitment of Vision-Language-Action Models cs.CV · 2026-05-12 · unverdicted · none · ref 6 · 3 links · internal anchor
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies cs.CV · 2026-04-27 · unverdicted · none · ref 39 · internal anchor
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning cs.RO · 2026-04-19 · unverdicted · none · ref 17 · internal anchor
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models cs.CV · 2026-06-21 · unverdicted · none · ref 35 · internal anchor
PolicyTrim is an RL post-training framework that boosts VLA policy efficiency by 3x chunk utilization and 51.4% fewer steps, yielding up to 5.83x speedup.
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning cs.RO · 2026-06-11 · unverdicted · none · ref 17 · internal anchor
GRASP maps natural language to bounding-box goals via VLM for neuro-symbolic planning and reports 73.3% success in 90 real-robot trials without task-specific training.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data cs.RO · 2026-05-13 · unverdicted · none · ref 73 · internal anchor
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 64 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing cs.RO · 2026-05-05 · unverdicted · none · ref 26 · internal anchor
ScanHD achieves 92.7% exact accuracy and 98.1% Win@1 accuracy in recommending discrete scanning parameters from instructions and images on a new real-world dataset.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training cs.AI · 2026-04-12 · unverdicted · none · ref 20 · 2 links · internal anchor
TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.
FASTER: Rethinking Real-Time Flow VLAs cs.RO · 2026-03-19 · unverdicted · none · ref 70 · 2 links · internal anchor
FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation cs.RO · 2026-02-22 · unverdicted · none · ref 41 · internal anchor
OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions cs.CV · 2025-11-21 · conditional · none · ref 19 · internal anchor
VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.
ESC: Emotional Self-Correction for Reliable Vision-Language Models cs.CV · 2026-07-01 · unverdicted · none · ref 80 · internal anchor
ESC uses emotional cues triggered by an external verifier to enable training-free self-correction in VLMs, improving reliability on safety, hallucination, and reasoning benchmarks.
GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors cs.RO · 2026-06-07 · unverdicted · none · ref 4 · internal anchor
GraspFoM creates a shared 3D latent from SAM3D priors, adds an anchor-initialized diffuser for multimodal grasps, and uses reconstruction-aware scoring plus residual updates to jointly achieve SOTA reconstruction and grasping with few extra parameters.
World Models for Robotic Manipulation: A Survey cs.RO · 2026-05-27 · accept · none · ref 22 · internal anchor
Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.
VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-BasedData Synthesis cs.RO · 2026-05-16 · unverdicted · none · ref 18 · internal anchor
VLAMotor exposes VLA failures via distance-aware uncertainty testing and synthesizes agent-planned repair data to fine-tune models, reporting 49.25% success rate gains in simulation and 57.5% on hardware.
A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory cs.RO · 2026-05-04 · unverdicted · none · ref 8 · internal anchor
The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000-fold latency reduction on Raspberry Pi 5 robots with no GPU or training data.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System cs.CV · 2026-04-15 · unverdicted · none · ref 30 · 2 links · internal anchor
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development cs.RO · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization cs.RO · 2026-04-15 · unverdicted · none · ref 21 · internal anchor
EEAgent with LSTRO sets new state-of-the-art results on six VIMA-Bench robotic manipulation tasks by dynamically refining prompts through reflection on successes and failures.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment cs.RO · 2026-04-07 · unverdicted · none · ref 47 · internal anchor
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control cs.AI · 2025-12-29 · unverdicted · none · ref 37 · internal anchor
A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.
RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation cs.RO · 2025-10-20 · unverdicted · none · ref 17 · internal anchor
RESample uses exploratory sampling guided by a lightweight Coverage Function to expand VLA training data coverage, yielding 12% performance gains on LIBERO and real-world tasks with 10-20% added samples.
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model cs.CV · 2026-05-21 · unverdicted · none · ref 19 · 2 links · internal anchor
BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT performance with 50% data in sim-to-real transfer.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments cs.CV · 2026-04-20 · unverdicted · none · ref 85 · internal anchor
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines cs.RO · 2026-04-24 · unverdicted · none · ref 19 · internal anchor
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data cs.RO · 2026-04-04 · unreviewed · ref 82 · internal anchor

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer