hub Canonical reference

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du · 2024 · cs.CV · arXiv 2403.09631

Canonical reference. 88% of citing Pith papers cite this work as background.

46 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 46 citing papers arXiv PDF

abstract

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 22 baseline 1 method 1

citation-polarity summary

background 21 baseline 1 unclear 1 use method 1

representative citing papers

GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

cs.RO · 2026-05-20 · unverdicted · novelty 7.0

GaussianDream is a feed-forward 3D Gaussian world model that provides structured spatial-temporal supervision for VLA-based robotic manipulation by predicting future Gaussian states during training.

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

Action Images: End-to-End Policy Learning via Multiview Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

cs.RO · 2026-02-23 · unverdicted · novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

cs.CV · 2025-11-16 · conditional · novelty 7.0

BridgeEQA creates a new benchmark and EMVR method for embodied agents to perform question answering on real-world bridge inspections using egocentric images and professional reports.

Beyond Syntax: Action Semantics Learning for App Agents

cs.AI · 2025-06-21 · unverdicted · novelty 7.0

Action Semantics Learning trains app agents to align with the semantic effects of actions via a Semantic Estimator module, improving robustness to out-of-distribution scenarios over syntax-matching fine-tuning.

VLAs are Confined yet Capable of Generalizing to Novel Instructions

cs.RO · 2025-05-06 · unverdicted · novelty 7.0

Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

Affordance Agent Harness: Verification-Gated Skill Orchestration

cs.RO · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.

LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

cs.RO · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.

ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

cs.RO · 2026-03-26 · unverdicted · novelty 6.0

ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.

Learning Native Continuation for Action Chunking Flow Policies

cs.RO · 2026-02-13 · unverdicted · novelty 6.0

Legato trains flow-based VLA policies with schedule-shaped action-noise mixtures and randomized conditions to achieve smoother trajectories and ~10% faster task completion than real-time chunking across five real-world manipulation tasks.

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

cs.RO · 2026-01-11 · unverdicted · novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

cs.CV · 2025-12-29 · unverdicted · novelty 6.0

GaussianDWM uses 3D Gaussians with embedded linguistic features, language-guided sampling, and dual-condition generation for unified scene understanding and multi-modal output in driving world models.

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

cs.CV · 2025-07-06 · unverdicted · novelty 6.0

DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.

GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation

cs.RO · 2025-06-17 · unverdicted · novelty 6.0

GAF creates 4D dynamic scene models by adding motion to 3D Gaussians, enabling better reconstruction and 7.3% higher success in robotic tasks.

Real-Time Execution of Action Chunking Flow Policies

cs.RO · 2025-06-09 · unverdicted · novelty 6.0

Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.

FLARE: Robot Learning with Implicit World Modeling

cs.RO · 2025-05-21 · unverdicted · novelty 6.0

FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

cs.RO · 2025-05-06 · unverdicted · novelty 6.0

GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

cs.LG · 2025-04-22 · unverdicted · novelty 6.0

π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.

citing papers explorer

Showing 46 of 46 citing papers.

GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation cs.RO · 2026-05-20 · unverdicted · none · ref 24 · internal anchor
GaussianDream is a feed-forward 3D Gaussian world model that provides structured spatial-temporal supervision for VLA-based robotic manipulation by predicting future Gaussian states during training.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 46 · internal anchor
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · conditional · none · ref 50 · 3 links · internal anchor
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 12 · internal anchor
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 69 · internal anchor
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models cs.RO · 2026-02-23 · unverdicted · none · ref 40 · internal anchor
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections cs.CV · 2025-11-16 · conditional · none · ref 49 · internal anchor
BridgeEQA creates a new benchmark and EMVR method for embodied agents to perform question answering on real-world bridge inspections using egocentric images and professional reports.
Beyond Syntax: Action Semantics Learning for App Agents cs.AI · 2025-06-21 · unverdicted · none · ref 20 · internal anchor
Action Semantics Learning trains app agents to align with the semantic effects of actions via a Semantic Estimator module, improving robustness to out-of-distribution scenarios over syntax-matching fine-tuning.
VLAs are Confined yet Capable of Generalizing to Novel Instructions cs.RO · 2025-05-06 · unverdicted · none · ref 44 · internal anchor
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 90 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Affordance Agent Harness: Verification-Gated Skill Orchestration cs.RO · 2026-05-01 · unverdicted · none · ref 87 · 2 links · internal anchor
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning cs.RO · 2026-04-30 · unverdicted · none · ref 51 · 2 links · internal anchor
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model cs.RO · 2026-04-24 · unverdicted · none · ref 50 · internal anchor
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 42 · internal anchor
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making cs.RO · 2026-03-26 · unverdicted · none · ref 21 · internal anchor
ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.
Learning Native Continuation for Action Chunking Flow Policies cs.RO · 2026-02-13 · unverdicted · none · ref 44 · internal anchor
Legato trains flow-based VLA policies with schedule-shaped action-noise mixtures and randomized conditions to achieve smoother trajectories and ~10% faster task completion than real-time chunking across five real-world manipulation tasks.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation cs.RO · 2026-01-11 · unverdicted · none · ref 150 · internal anchor
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation cs.CV · 2025-12-29 · unverdicted · none · ref 62 · internal anchor
GaussianDWM uses 3D Gaussians with embedded linguistic features, language-guided sampling, and dual-condition generation for unified scene understanding and multi-modal output in driving world models.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge cs.CV · 2025-07-06 · unverdicted · none · ref 44 · internal anchor
DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.
GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation cs.RO · 2025-06-17 · unverdicted · none · ref 66 · internal anchor
GAF creates 4D dynamic scene models by adding motion to 3D Gaussians, enabling better reconstruction and 7.3% higher success in robotic tasks.
Real-Time Execution of Action Chunking Flow Policies cs.RO · 2025-06-09 · unverdicted · none · ref 70 · internal anchor
Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.
FLARE: Robot Learning with Implicit World Modeling cs.RO · 2025-05-21 · unverdicted · none · ref 39 · internal anchor
FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data cs.RO · 2025-05-06 · unverdicted · none · ref 22 · internal anchor
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization cs.LG · 2025-04-22 · unverdicted · none · ref 90 · internal anchor
π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models cs.CV · 2025-03-27 · unverdicted · none · ref 79 · internal anchor
CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots cs.RO · 2025-03-18 · unverdicted · none · ref 103 · internal anchor
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success cs.RO · 2025-02-27 · accept · none · ref 59 · internal anchor
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control cs.RO · 2025-02-09 · unverdicted · none · ref 19 · internal anchor
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
FAST: Efficient Action Tokenization for Vision-Language-Action Models cs.RO · 2025-01-16 · unverdicted · none · ref 73 · internal anchor
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought cs.CL · 2025-01-13 · unverdicted · none · ref 25 · internal anchor
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
OpenVLA: An Open-Source Vision-Language-Action Model cs.RO · 2024-06-13 · unverdicted · none · ref 75 · internal anchor
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset cs.RO · 2024-03-19 · accept · none · ref 67 · internal anchor
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation cs.AI · 2026-05-17 · unverdicted · none · ref 17 · internal anchor
ECG-WM combines ODE physiological priors with latent diffusion models to generate intervention-conditioned ECG trajectories and uses diffusion stochasticity for uncertainty-aware clinical risk assessment.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization cs.RO · 2026-05-17 · unverdicted · none · ref 57 · internal anchor
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
Nautilus: From One Prompt to Plug-and-Play Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 55 · internal anchor
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning cs.RO · 2026-04-20 · unverdicted · none · ref 48 · internal anchor
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
R3D: Revisiting 3D Policy Learning cs.CV · 2026-04-16 · unverdicted · none · ref 52 · internal anchor
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment cs.RO · 2026-04-07 · unverdicted · none · ref 67 · internal anchor
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details cs.CV · 2025-07-03 · unverdicted · none · ref 80 · internal anchor
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
WorldVLA: Towards Autoregressive Action World Model cs.RO · 2025-06-26 · unverdicted · none · ref 32 · internal anchor
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
What Matters in Building Vision-Language-Action Models for Generalist Robots cs.RO · 2024-12-18 · unverdicted · none · ref 53 · internal anchor
Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.
Towards Robotic Dexterous Hand Intelligence: A Survey cs.RO · 2026-05-13 · unverdicted · none · ref 112 · internal anchor
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 270 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines cs.RO · 2026-04-24 · unverdicted · none · ref 31 · internal anchor
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.
AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models cs.CV · 2026-02-11 · unverdicted · none · ref 6 · internal anchor
AugVLA-3D augments existing VLA models with depth-derived 3D features and action priors to improve generalization and action accuracy in 3D robotic tasks.
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems eess.SY · 2026-04-03 · unverdicted · none · ref 183 · internal anchor
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

3D-VLA: A 3D Vision-Language-Action Generative World Model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer