World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

Hai Ci; Kevin Yuchen Ma; Mike Zheng Shou; Xiaokang Liu; Zechen Bai

arxiv: 2602.06508 · v2 · pith:TXJR54DUnew · submitted 2026-02-06 · 💻 cs.RO

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

Xiaokang Liu , Zechen Bai , Hai Ci , Kevin Yuchen Ma , Mike Zheng Shou This is my paper

classification 💻 cs.RO

keywords worldmodelworld-vla-looppolicyvideobehaviorlearningnear-success

0 comments

read the original abstract

Reinforcement learning (RL) can refine Vision-Language-Action (VLA) policies beyond behavior cloning, but real-world RL remains expensive due to extensive rollouts, resets, supervision, and safety risks. Action-conditioned video world models offer an option to train in virtual environments, yet they exhibit imprecise action following, particularly on subtle near-success failures. Besides, they lack native reward signals for RL. Computing rewards based on inaccurate visual predictions remain unreliable. We introduce World-VLA-Loop, structured around two foundational designs and a higher-level co-evolving paradigm. We first curate SANS, dedicatedly mixing successful and near-success trajectories to improve action-outcome alignment. Then, we train a state-aware video world model that jointly predicts future frames and binary rewards from diffusion latents. It couples reward estimation to the generator rather than a separate module, and in turn, benefits visual prediction. Since VLA behavior shifts during RL, a fixed simulator can misalign with the updated policy, World-VLA-Loop therefore closes the loop by using the refined world model for iterative VLA post-training while feeding rollouts from each improved policy back to augment and fine-tune the world model. Across simulation and real-robot experiments, World-VLA-Loop substantially improves VLA performance while reducing reliance on costly physical interaction.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
cs.LG 2026-05 unverdicted novelty 7.0

PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
Feedback World Model Enables Precise Guidance of Diffusion Policy
cs.RO 2026-05 unverdicted novelty 6.0

Feedback world model closes the prediction-observation loop at inference time to correct errors and improve diffusion policy performance under distribution shift in robotics.
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
cs.RO 2026-04 unverdicted novelty 6.0

ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA
cs.RO 2026-04 unverdicted novelty 6.0

SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.
Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts
cs.CV 2026-05 unverdicted novelty 4.0

Pre-VLA is a multimodal runtime verifier that predicts safety confidence and advantage scores for action chunks, raising closed-loop success rates on the LIBERO benchmark from 30.79% to 37.62%.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...