JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
hub Canonical reference
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
PolicyTrim is an RL post-training framework that boosts VLA policy efficiency by 3x chunk utilization and 51.4% fewer steps, yielding up to 5.83x speedup.
GRASP maps natural language to bounding-box goals via VLM for neuro-symbolic planning and reports 73.3% success in 90 real-robot trials without task-specific training.
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
ScanHD achieves 92.7% exact accuracy and 98.1% Win@1 accuracy in recommending discrete scanning parameters from instructions and images on a new real-world dataset.
TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.
FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.
VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.
ESC uses emotional cues triggered by an external verifier to enable training-free self-correction in VLMs, improving reliability on safety, hallucination, and reasoning benchmarks.
GraspFoM creates a shared 3D latent from SAM3D priors, adds an anchor-initialized diffuser for multimodal grasps, and uses reconstruction-aware scoring plus residual updates to jointly achieve SOTA reconstruction and grasping with few extra parameters.
Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.
VLAMotor exposes VLA failures via distance-aware uncertainty testing and synthesizes agent-planned repair data to fine-tune models, reporting 49.25% success rate gains in simulation and 57.5% on hardware.
The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000-fold latency reduction on Raspberry Pi 5 robots with no GPU or training data.
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
EEAgent with LSTRO sets new state-of-the-art results on six VIMA-Bench robotic manipulation tasks by dynamically refining prompts through reflection on successes and failures.
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.
citing papers explorer
-
Point Tracking Improves World Action Models
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
-
ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics
ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
-
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
-
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
-
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models
PolicyTrim is an RL post-training framework that boosts VLA policy efficiency by 3x chunk utilization and 51.4% fewer steps, yielding up to 5.83x speedup.
-
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
GRASP maps natural language to bounding-box goals via VLM for neuro-symbolic planning and reports 73.3% success in 90 real-robot trials without task-specific training.
-
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing
ScanHD achieves 92.7% exact accuracy and 98.1% Win@1 accuracy in recommending discrete scanning parameters from instructions and images on a new real-world dataset.
-
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
TorchUMM is the first unified codebase and benchmark suite for multimodal understanding, generation, and editing across varied UMM models and datasets.
-
FASTER: Rethinking Real-Time Flow VLAs
FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
-
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.
-
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions
VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.
-
ESC: Emotional Self-Correction for Reliable Vision-Language Models
ESC uses emotional cues triggered by an external verifier to enable training-free self-correction in VLMs, improving reliability on safety, hallucination, and reasoning benchmarks.
-
GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors
GraspFoM creates a shared 3D latent from SAM3D priors, adds an anchor-initialized diffuser for multimodal grasps, and uses reconstruction-aware scoring plus residual updates to jointly achieve SOTA reconstruction and grasping with few extra parameters.
-
World Models for Robotic Manipulation: A Survey
Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.
-
VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-BasedData Synthesis
VLAMotor exposes VLA failures via distance-aware uncertainty testing and synthesizes agent-planned repair data to fine-tune models, reporting 49.25% success rate gains in simulation and 57.5% on hardware.
-
A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory
The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000-fold latency reduction on Raspberry Pi 5 robots with no GPU or training data.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
-
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development
EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
-
Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization
EEAgent with LSTRO sets new state-of-the-art results on six VIMA-Bench robotic manipulation tasks by dynamically refining prompts through reflection on successes and failures.
-
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
-
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control
A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.
-
RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation
RESample uses exploratory sampling guided by a lightweight Coverage Function to expand VLA training data coverage, yielding 12% performance gains on LIBERO and real-world tasks with 10-20% added samples.
-
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT performance with 50% data in sim-to-real transfer.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
-
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.
- From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data