pith. sign in

hub Mixed citations

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Mixed citation behavior. Most common role is background (62%).

27 Pith papers citing it
Background 62% of classified citations
abstract

Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of latent actions, abstract representations of motion between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Vision-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. We demonstrate that villa-X can generate latent action plans in a zero-shot fashion, even for unseen embodiments and open-vocabulary symbolic understanding. This capability enables villa-X to achieve superior performance across diverse simulation tasks in SIMPLER and on two real-world robotic setups involving both gripper and dexterous hand manipulation. These results establish villa-X as a principled and scalable paradigm for learning generalizable robot manipulation policies. We believe it provides a strong foundation for future research.

hub tools

citation-role summary

background 10 other 2 baseline 1

citation-polarity summary

years

2026 24 2025 3

verdicts

UNVERDICTED 27

representative citing papers

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.

CUBic: Coordinated Unified Bimanual Perception and Control Framework

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordination accuracy and task success on the RoboTwin benchmark.

Reinforcing VLAs in Task-Agnostic World Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

GazeVLA: Learning Human Intention for Robotic Manipulation

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

Co-Evolving Latent Action World Models

cs.LG · 2025-10-30 · unverdicted · novelty 6.0

CoLA-World jointly trains latent action models and world models with a warm-up phase to achieve co-evolution, matching or exceeding prior two-stage methods in video simulation quality and visual planning performance.

citing papers explorer

Showing 27 of 27 citing papers.