VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
Canonical reference
Omninwm: Omniscient driving navigation world models
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. However, existing methods are typically restricted to fragmented modality modeling, short-horizon drift, and imprecise action control, while lacking intrinsic mechanisms for policy evaluation. In this paper, we introduce OmniNWM, an Omniscient panoramic Navigation World Model that addresses all three dimensions within a consistent probabilistic framework. For State, OmniNWM generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy, ensuring pixel-level alignment across modalities with joint distribution modeling. To mitigate autoregressive exposure bias, we propose a structured panoramic forcing strategy to stabilize long-horizon generation via stochastic manifold thickening. For Action, we introduce canonical geometric action encoding with normalized panoramic Pl\"ucker ray-maps. This representation decouples motion dynamics from sensor intrinsics, enabling precise, zero-shot trajectory control across heterogeneous datasets and camera configurations. For Reward, we derive intrinsic occupancy-grounded dense rewards directly from generated 3D volumes, establishing a reliable closed-loop simulation cycle for evaluating diverse planning agents. Extensive experiments demonstrate that OmniNWM achieves SOTA performance in generation fidelity and control precision, with remarkable zero-shot robustness to novel scenes on NuPlan and in-house datasets with distinct camera rigs. Project page is available at https://arlo0o.github.io/OmniNWM/.
citation-role summary
citation-polarity summary
fields
cs.CV 10roles
background 5polarities
background 5representative citing papers
COVScene is a pose-free framework that lifts semantic Gaussians into a volumetric occupancy field during training to jointly support novel view synthesis, open-vocabulary segmentation, and semantic occupancy prediction.
PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
ReWorld applies future-predictive, cross-modal, and hard-negative supervision directly to intermediate representations in Video and Action DiTs for WAMs, reporting 23.9% FVD improvement and PDMS rise from 89.1 to 90.4 on nuScenes and NAVSIM.
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.
citing papers explorer
-
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.