BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents

Fan Wang; HaiFeng Wang; Hua Wu; Jizhou Huang; Kaixin Xiong; Shi Gong; Xiaofan Li; Xiaoqing Ye; Xiao Tan; Yumeng Zhang

arxiv: 2407.05679 · v3 · pith:AZBANBTGnew · submitted 2024-07-08 · 💻 cs.CV · cs.AI

BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents

Yumeng Zhang , Shi Gong , Kaixin Xiong , Xiaoqing Ye , Xiaofan Li , Xiao Tan , Fan Wang , Jizhou Huang

show 2 more authors

Hua Wu Haifeng Wang

This is my paper

classification 💻 cs.CV cs.AI

keywords latentautonomousbevworlddrivingfuturemodelworlddata

0 comments

read the original abstract

World models have attracted increasing attention in autonomous driving for their ability to forecast potential future scenarios. In this paper, we propose BEVWorld, a novel framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for holistic environment modeling. The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model. The multi-modal tokenizer first encodes heterogeneous sensory data, and its decoder reconstructs the latent BEV tokens into LiDAR and surround-view image observations via ray-casting rendering in a self-supervised manner. This enables joint modeling and bidirectional encoding-decoding of panoramic imagery and point cloud data within a shared spatial representation. On top of this, the latent BEV sequence diffusion model performs temporally consistent forecasting of future scenes, conditioned on high-level action tokens, enabling scene-level reasoning over time. Extensive experiments demonstrate the effectiveness of BEVWorld on autonomous driving benchmarks, showcasing its capability in realistic future scene generation and its benefits for downstream tasks such as perception and motion prediction.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations
cs.CV 2025-06 unverdicted novelty 7.0

BEVCALIB performs LiDAR-camera calibration from raw data by fusing camera and LiDAR bird's-eye view features with a novel feature selector and reports state-of-the-art accuracy on KITTI and NuScenes.
HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-training
cs.CV 2026-06 unverdicted novelty 6.0

HilDA pre-trains LiDAR backbones via multi-layer and global distillation from vision models plus temporal occupancy diffusion, yielding SOTA results on detection, flow, and occupancy tasks.
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
cs.CV 2026-04 unverdicted novelty 6.0

HERMES++ unifies 3D scene understanding and future geometry prediction in driving scenes via BEV representations, LLM-enhanced queries, a temporal link, and joint geometric optimization.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
ReSim: Reliable World Simulation for Autonomous Driving
cs.CV 2025-06 unverdicted novelty 6.0

ReSim is a controllable video world model trained on heterogeneous real and simulated driving data that achieves higher fidelity and controllability for both expert and non-expert actions, plus a Video2Reward module f...
PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models
cs.AI 2026-06 unverdicted novelty 5.0

PLAN-S decodes a style-conditioned four-channel semantic cost map from latent representations to bridge world models and planners in autonomous driving, reporting 0.55 m average L2 and 42% collision reduction on nuSce...
DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment
cs.RO 2025-04 unverdicted novelty 5.0

DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.