super hub Canonical reference

WorldVLA: Towards Autoregressive Action World Model

Chaohui Yu, Hangjie Yuan, Jiayan Guo, Jun Cen, Siteng Huang, Yuming Jiang · 2025 · cs.RO · arXiv 2506.21539

Canonical reference. 83% of citing Pith papers cite this work as background.

115 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 115 citing papers more from Chaohui Yu arXiv PDF

abstract

We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrate that WorldVLA outperforms standalone action and world models, highlighting the mutual enhancement between the world model and the action model. In addition, we find that the performance of the action model deteriorates when generating sequences of actions in an autoregressive manner. This phenomenon can be attributed to the model's limited generalization capability for action prediction, leading to the propagation of errors from earlier actions to subsequent ones. To address this issue, we propose an attention mask strategy that selectively masks prior actions during the generation of the current action, which shows significant performance improvement in the action chunk generation task.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 25 baseline 4 method 1

citation-polarity summary

background 25 baseline 4 use method 1

claims ledger

abstract We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrat

authors

Chaohui Yu Hangjie Yuan Jiayan Guo Jun Cen Siteng Huang Yuming Jiang

co-cited works

representative citing papers

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

cs.RO · 2026-06-04 · unverdicted · novelty 7.0

PiL-World introduces a chunk-wise world model for closed-loop VLA policy evaluation that reduces the gap between simulated and real success rates from 63.2% to 12.0% on three dual-arm manipulation tasks by conditioning on action-derived visual control and latent histories while training on both succ

Can VLA Models Learn from Real-World Data Continually without Forgetting?

cs.RO · 2026-05-26 · unverdicted · novelty 7.0

VLA models exhibit catastrophic forgetting on a new real-world dataset of four sequential manipulation tasks, with experience replay implementation factors evaluated for mitigation.

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.

Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

cs.RO · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

Learning Visual Feature-Based World Models via Residual Latent Action

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU

cs.OS · 2026-05-02 · unverdicted · novelty 7.0

VUDA enables spatial sharing between CUDA and Vulkan on GPUs via channel redirection and page-table grafting, achieving up to 85% higher throughput than temporal baselines in embodied AI tasks.

Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

cs.AI · 2026-05-01 · unverdicted · novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.

Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

cs.RO · 2026-04-27 · conditional · novelty 7.0

Libra-VLA introduces a coarse-to-fine dual-system architecture for VLA models that decouples discrete macro-directional planning from continuous micro-pose refinement, with performance peaking at balanced learning difficulty.

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

cs.RO · 2026-04-23 · unverdicted · novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

Learning Vision-Language-Action World Models for Autonomous Driving

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

Telecom World Models: Unifying Digital Twins, Foundation Models, and Predictive Planning for 6G

cs.RO · 2026-04-08 · unverdicted · novelty 7.0

Telecom World Models introduce a three-layer architecture for learned, action-conditioned, uncertainty-aware modeling of 6G network dynamics, combining digital twins and foundation models, with a network slicing proof-of-concept showing improved KPI prediction over baselines.

DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

cs.RO · 2026-03-27 · conditional · novelty 7.0

DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

VLAFlow shows that combining language-supervised co-training with future latent alignment produces the most stable transfer performance for vision-language-action models trained on mixed robot data.

citing papers explorer

Showing 50 of 106 citing papers after filters.

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots cs.RO · 2026-07-02 · unverdicted · none · ref 29 · internal anchor
Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 5 · internal anchor
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models cs.RO · 2026-06-22 · unverdicted · none · ref 10 · internal anchor
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 8 · internal anchor
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space cs.RO · 2026-06-16 · unverdicted · none · ref 35 · internal anchor
PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.
X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining cs.CV · 2026-06-07 · unverdicted · none · ref 10 · internal anchor
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation cs.RO · 2026-06-04 · unverdicted · none · ref 22 · internal anchor
PiL-World introduces a chunk-wise world model for closed-loop VLA policy evaluation that reduces the gap between simulated and real success rates from 63.2% to 12.0% on three dual-arm manipulation tasks by conditioning on action-derived visual control and latent histories while training on both succ
Can VLA Models Learn from Real-World Data Continually without Forgetting? cs.RO · 2026-05-26 · unverdicted · none · ref 16 · internal anchor
VLA models exhibit catastrophic forgetting on a new real-world dataset of four sequential manipulation tasks, with experience replay implementation factors evaluated for mitigation.
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning cs.RO · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models cs.AI · 2026-05-12 · unverdicted · none · ref 8 · internal anchor
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models cs.RO · 2026-05-12 · unverdicted · none · ref 20 · 2 links · internal anchor
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
Learning Visual Feature-Based World Models via Residual Latent Action cs.CV · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU cs.OS · 2026-05-02 · unverdicted · none · ref 7 · internal anchor
VUDA enables spatial sharing between CUDA and Vulkan on GPUs via channel redirection and page-table grafting, achieving up to 85% higher throughput than temporal baselines in embodied AI tasks.
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation cs.AI · 2026-05-01 · unverdicted · none · ref 4 · internal anchor
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis cs.RO · 2026-04-23 · unverdicted · none · ref 19 · internal anchor
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
WorldMark: A Unified Benchmark Suite for Interactive Video World Models cs.CV · 2026-04-23 · unverdicted · none · ref 4 · internal anchor
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
Learning Vision-Language-Action World Models for Autonomous Driving cs.CV · 2026-04-10 · unverdicted · none · ref 11 · internal anchor
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
Telecom World Models: Unifying Digital Twins, Foundation Models, and Predictive Planning for 6G cs.RO · 2026-04-08 · unverdicted · none · ref 27 · internal anchor
Telecom World Models introduce a three-layer architecture for learned, action-conditioned, uncertainty-aware modeling of 6G network dynamics, combining digital twins and foundation models, with a network slicing proof-of-concept showing improved KPI prediction over baselines.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 7 · internal anchor
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment cs.CV · 2026-07-02 · unverdicted · none · ref 11 · internal anchor
VLAFlow shows that combining language-supervised co-training with future latent alignment produces the most stable transfer performance for vision-language-action models trained on mixed robot data.
Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation cs.RO · 2026-07-01 · unverdicted · none · ref 8 · internal anchor
Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.
ABot-M0.5: Unified Mobility-and-Manipulation World Action Model cs.CV · 2026-07-01 · unverdicted · none · ref 8 · internal anchor
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models' performance cs.RO · 2026-06-29 · unverdicted · none · ref 24 · internal anchor
SA-VLA adds state conditioning to VQ-based action tokenization in VLA policies, expanding each discrete token's effective support to state-dependent actions and raising average success rates from 0.29 to 0.56 on 12 sim tasks and 0.15 to 0.33 on 3 real tasks.
ActWorld: From Explorable to Interactive World Model via Action-Aware Memory cs.CV · 2026-06-16 · unverdicted · none · ref 2 · internal anchor
ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated interaction dataset.
Geometric Action Model for Robot Policy Learning cs.RO · 2026-06-15 · unverdicted · none · ref 35 · internal anchor
GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.
Unified Motion-Action Modeling for Heterogeneous Robot Learning cs.RO · 2026-06-15 · unverdicted · none · ref 42 · internal anchor
UMA treats object motion and robot actions as co-evolving variables under a masked generative objective with hindsight relabeling and contrastive disentanglement to support multi-task pretraining and deployment across heterogeneous robot data.
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies cs.RO · 2026-06-10 · unverdicted · none · ref 42 · internal anchor
APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.
Next Forcing: Causal World Modeling with Multi-Chunk Prediction cs.CV · 2026-06-09 · unverdicted · none · ref 11 · internal anchor
Next Forcing augments video generation models with auxiliary multi-chunk prediction modules to achieve faster training convergence, higher accuracy at high frame rates, and 2x faster inference on world modeling benchmarks.
Prisma-World: Camera-Controllable Multi-Agent Video World Model cs.CV · 2026-06-08 · unverdicted · none · ref 13 · internal anchor
Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from flexible agent counts.
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 17 · internal anchor
MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.
Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 13 · internal anchor
Dream-Tac unifies visual and tactile signals in a world action model using contact-gated fusion and attention bias, reporting 31.7% average action accuracy gains on six manipulation tasks.
FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 32 · internal anchor
FAWAM integrates force signals into perception, prediction, and closed-loop correction, raising success rates 36% over vision baselines in contact-rich manipulation tasks.
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 40 · internal anchor
GEAR-VLA learns geometry-aware action representations via coarse-to-fine pretraining, gradient-decoupled DiT action expert, semantic-aligned 3D integration, and embodiment canonicalization, reporting SOTA results on LIBERO benchmarks and over 80% success on unseen embodiments and 212 unseen objects.
MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action cs.RO · 2026-06-04 · unverdicted · none · ref 20 · internal anchor
MPCoT improves long-horizon VLA performance on LIBERO and CALVIN by initializing M latent hypotheses, refining them over K steps, and aggregating via a reward-trained path scorer while preserving the original 8-step action interface and generating zero reasoning tokens.
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding cs.RO · 2026-06-04 · unverdicted · none · ref 7 · internal anchor
AffordanceVLA proposes a VLA model with affordance-aware modules (Which2Act, Where2Act, How2Act) in a Mixture-of-Transformer trained in three stages to improve robotic manipulation.
WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation cs.AI · 2026-06-04 · unverdicted · none · ref 3 · internal anchor
WorldFly integrates a world model into a VLA framework via dual-branch coupled flow matching to jointly generate future videos and actions, outperforming baselines on an urban canyon traversal benchmark especially in unseen environments.
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 9 · internal anchor
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation cs.RO · 2026-06-01 · unverdicted · none · ref 13 · internal anchor
AHEAD augments frozen VLAs with a 4.9M-parameter latent world model that forecasts future visual features using optical-flow motion cues, achieving 79-97% success on dynamic simulation tasks and high real-robot success rates where baselines score near zero.
ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation cs.RO · 2026-05-28 · unverdicted · none · ref 8 · internal anchor
ELAN4D introduces plug-and-play 4D keypoint track supervision from forward kinematics to enhance VLA policy generalization in robotic manipulation tasks.
3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding cs.RO · 2026-05-28 · unverdicted · none · ref 5 · internal anchor
3DVLA is a plug-and-play framework that enhances pretrained VLAs with pervasive 3D feature encoding using multi-view consistency and Spatially-Conditioned Geometry Aggregation, an instance estimation module, and a masked self-supervised 3D branch, yielding gains on LIBERO-Plus and RoboTwin 2.0.
Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance cs.RO · 2026-05-22 · unverdicted · none · ref 5 · internal anchor
Afford-VLA internalizes task-conditioned affordance as an explicit visual planning interface within VLA models via learnable <AFF> tokens, achieving SOTA on LIBERO and SimplerEnv benchmarks.
GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation cs.RO · 2026-05-20 · unverdicted · none · ref 37 · internal anchor
GaussianDream is a feed-forward 3D Gaussian world model plug-in that conditions VLA policies on learned 3D spatial and future evolution representations for improved robotic manipulation performance.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization cs.RO · 2026-05-12 · unverdicted · none · ref 12 · 2 links · internal anchor
GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.
The DAWN of World-Action Interactive Models cs.CV · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 9 · 2 links · internal anchor
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation cs.RO · 2026-05-08 · unverdicted · none · ref 11 · internal anchor
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 9 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning cs.RO · 2026-04-30 · unverdicted · none · ref 19 · 2 links · internal anchor
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

WorldVLA: Towards Autoregressive Action World Model

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer