hub Canonical reference

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani · 2025 · cs.CV · arXiv 2503.20523

Canonical reference. 100% of citing Pith papers cite this work as background.

48 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 48 citing papers arXiv PDF

abstract

Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia-2.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13

citation-polarity summary

background 13

representative citing papers

OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

cs.CV · 2026-06-16 · unverdicted · novelty 7.0

DRIVE-CHOREO uses three LLM agents to create a unified position-aware token sequence co-compressed with multi-view video, achieving SOTA BEV mAP of 21.6 and +2.4 NDS improvement on nuScenes.

Grounding Driving VLA via Inverse Kinematics

cs.CV · 2026-05-20 · conditional · novelty 7.0

By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.

Is Your Driving World Model an All-Around Player?

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.

Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

cs.CV · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.

Latent State Design for World Models under Sufficiency Constraints

cs.AI · 2026-05-03 · unverdicted · novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

cs.RO · 2026-04-23 · unverdicted · novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

BadDreamer: Transferable Backdoor Attacks against Video World Models for Autonomous Driving

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

Introduces BadDreamer, a backdoor attack that poisons the transition dynamics of video world models so that a trigger causes hallucination of obstacle-free futures, transferring to unsafe action predictions in autonomous driving.

Scaling Self-Play for End-to-End Driving

cs.RO · 2026-06-17 · unverdicted · novelty 6.0

Self-play DAgger training in a batched pixel renderer produces end-to-end driving policies that reach competitive performance on HUGSIM and NAVSIM-v2 after real-world adaptation and improve with more self-play compute.

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated interaction dataset.

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 6.0

WEAVER is a multi-view world model using flow-matching that jointly satisfies fidelity, consistency, and efficiency for robotic manipulation, yielding 0.87 correlation with real success and policy gains on hardware.

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

cs.CV · 2026-06-09 · unverdicted · novelty 6.0

Next Forcing augments video generation models with auxiliary multi-chunk prediction modules to achieve faster training convergence, higher accuracy at high frame rates, and 2x faster inference on world modeling benchmarks.

Prisma-World: Camera-Controllable Multi-Agent Video World Model

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from flexible agent counts.

Geometry-Aware Implicit Memory for Video World Models

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

StressDream optimizes initial noise in diffusion video world models using VLM semantic and plausibility objectives to steer generations toward specified high-impact outcomes for improved policy evaluation.

AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond

cs.RO · 2026-05-25 · unverdicted · novelty 6.0

AnyScene is an occupancy-centric framework using a Spatial-Temporal Occupancy Diffusion Transformer and Geometry-Grounded View Expansion to generate controllable driving scenes and videos from BEV layouts.

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.

Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Real2Sim reconstructs editable dynamic driving scenes as temporally continuous Gaussians integrated with a differentiable MPM physics solver for high-fidelity simulation of interactions and collisions.

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HorizonDrive is a new anti-drifting autoregressive training and distillation method that enables minute-scale stable driving video rollouts by making the teacher model rollout-capable via scheduled rollout recovery and teacher rollout DMD.

LA-Pose: Latent Action Pretraining Meets Pose Estimation

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of magnitude less labeled 3D data.

Human Cognition in Machines: A Unified Perspective of World Models

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.

citing papers explorer

Showing 43 of 43 citing papers after filters.

OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation cs.CV · 2026-06-16 · unverdicted · none · ref 42 · internal anchor
DRIVE-CHOREO uses three LLM agents to create a unified position-aware token sequence co-compressed with multi-view video, achieving SOTA BEV mAP of 21.6 and +2.4 NDS improvement on nuScenes.
Is Your Driving World Model an All-Around Player? cs.CV · 2026-05-11 · unverdicted · none · ref 27 · internal anchor
WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 37 · 2 links · internal anchor
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
Latent State Design for World Models under Sufficiency Constraints cs.AI · 2026-05-03 · unverdicted · none · ref 54 · internal anchor
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis cs.RO · 2026-04-23 · unverdicted · none · ref 20 · internal anchor
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models cs.CV · 2026-04-20 · unverdicted · none · ref 33 · internal anchor
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation cs.CV · 2026-04-18 · unverdicted · none · ref 41 · internal anchor
ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 81 · internal anchor
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
BadDreamer: Transferable Backdoor Attacks against Video World Models for Autonomous Driving cs.CV · 2026-06-19 · unverdicted · none · ref 47 · internal anchor
Introduces BadDreamer, a backdoor attack that poisons the transition dynamics of video world models so that a trigger causes hallucination of obstacle-free futures, transferring to unsafe action predictions in autonomous driving.
Scaling Self-Play for End-to-End Driving cs.RO · 2026-06-17 · unverdicted · none · ref 67 · internal anchor
Self-play DAgger training in a batched pixel renderer produces end-to-end driving policies that reach competitive performance on HUGSIM and NAVSIM-v2 after real-world adaptation and improve with more self-play compute.
ActWorld: From Explorable to Interactive World Model via Action-Aware Memory cs.CV · 2026-06-16 · unverdicted · none · ref 24 · internal anchor
ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated interaction dataset.
WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation cs.RO · 2026-06-11 · unverdicted · none · ref 36 · internal anchor
WEAVER is a multi-view world model using flow-matching that jointly satisfies fidelity, consistency, and efficiency for robotic manipulation, yielding 0.87 correlation with real success and policy gains on hardware.
Next Forcing: Causal World Modeling with Multi-Chunk Prediction cs.CV · 2026-06-09 · unverdicted · none · ref 50 · internal anchor
Next Forcing augments video generation models with auxiliary multi-chunk prediction modules to achieve faster training convergence, higher accuracy at high frame rates, and 2x faster inference on world modeling benchmarks.
Prisma-World: Camera-Controllable Multi-Agent Video World Model cs.CV · 2026-06-08 · unverdicted · none · ref 56 · internal anchor
Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from flexible agent counts.
Geometry-Aware Implicit Memory for Video World Models cs.CV · 2026-06-01 · unverdicted · none · ref 40 · internal anchor
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement cs.CV · 2026-05-29 · unverdicted · none · ref 78 · internal anchor
StressDream optimizes initial noise in diffusion video world models using VLM semantic and plausibility objectives to steer generations toward specified high-impact outcomes for improved policy evaluation.
AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond cs.RO · 2026-05-25 · unverdicted · none · ref 37 · internal anchor
AnyScene is an occupancy-centric framework using a Spatial-Temporal Occupancy Diffusion Transformer and Geometry-Grounded View Expansion to generate controllable driving scenes and videos from BEV layouts.
PanoWorld: Geometry-Consistent Panoramic Video World Modeling cs.CV · 2026-05-14 · unverdicted · none · ref 19 · internal anchor
PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.
Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes cs.CV · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
Real2Sim reconstructs editable dynamic driving scenes as temporally continuous Gaussians integrated with a differentiable MPM physics solver for high-fidelity simulation of interactions and collisions.
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation cs.CV · 2026-05-12 · unverdicted · none · ref 18 · 2 links · internal anchor
HorizonDrive is a new anti-drifting autoregressive training and distillation method that enables minute-scale stable driving video rollouts by making the teacher model rollout-capable via scheduled rollout recovery and teacher rollout DMD.
LA-Pose: Latent Action Pretraining Meets Pose Estimation cs.CV · 2026-04-30 · unverdicted · none · ref 24 · internal anchor
LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of magnitude less labeled 3D data.
Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 143 · internal anchor
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CV · 2026-04-13 · unverdicted · none · ref 60 · internal anchor
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving cs.CV · 2026-04-09 · unverdicted · none · ref 39 · internal anchor
LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms eess.IV · 2026-03-30 · unverdicted · none · ref 172 · internal anchor
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World cs.CV · 2025-12-29 · unverdicted · none · ref 60 · internal anchor
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
AstraNav-World: World Model for Foresight Control and Consistency cs.CV · 2025-12-25 · unverdicted · none · ref 20 · internal anchor
AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.
Generative View Stitching cs.CV · 2025-10-28 · unverdicted · none · ref 7 · internal anchor
Generative View Stitching samples full video sequences in parallel using off-the-shelf Diffusion Forcing models plus Omni Guidance to produce stable, collision-free, loop-closing camera-guided videos.
HERO: Hierarchical Extrapolation and Refresh for Efficient World Models cs.CV · 2025-08-25 · unverdicted · none · ref 21 · internal anchor
HERO accelerates world model inference 1.73x via hierarchical patch-wise refresh in shallow layers and linear extrapolation in deeper layers with minimal quality loss.
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation cs.RO · 2025-08-07 · unverdicted · none · ref 26 · internal anchor
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 46 · internal anchor
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
RetailSMV: Exocentric vs. Egocentric Adaptation of Foundation Video World Models in Retail cs.CV · 2026-07-01 · unverdicted · none · ref 22 · internal anchor
Exocentric-only LoRA adaptation of Cosmos3-Nano on a new synchronized retail video dataset matches or exceeds combined ego+exo training on most held-out metrics.
ReWorld: Learning Better Representations for World Action Models cs.CV · 2026-06-25 · unverdicted · none · ref 22 · internal anchor
ReWorld applies future-predictive, cross-modal, and hard-negative supervision directly to intermediate representations in Video and Action DiTs for WAMs, reporting 23.9% FVD improvement and PDMS rise from 89.1 to 90.4 on nuScenes and NAVSIM.
FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model cs.CV · 2026-06-18 · unverdicted · none · ref 74 · internal anchor
FrozenDrive enables zero-shot text-guided generation of consistent multi-view driving scenes via a parameter-free frozen diffusion backbone with spatio-temporal attention, improving autonomous driving models on adverse conditions via data augmentation.
Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving cs.CV · 2026-05-18 · unverdicted · none · ref 7 · 2 links · internal anchor
A unified system integrating sparse-query 3D Gaussian reconstruction with multi-stage causal video generation for autonomous driving world models.
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation cs.CV · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.
Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic cs.AI · 2026-04-14 · unverdicted · none · ref 19 · internal anchor
This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.
Business World Model cs.AI · 2026-06-08 · unverdicted · none · ref 2 · 2 links · internal anchor
This paper introduces the Business World Model, a conceptual architecture that encodes business states, dynamics, and actions using semantic representations to support autonomous planning.
NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation cs.CV · 2026-06-02 · unverdicted · none · ref 44 · internal anchor
OmniDreams is a real-time generative world model mid- and post-trained from the Cosmos diffusion model on 21k hours of driving data to autoregressively generate action-conditioned videos for closed-loop AV simulation.
Advancing Open-source World Models cs.CV · 2026-01-28 · unverdicted · none · ref 58 · internal anchor
LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.
World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications cs.LG · 2026-05-28 · unverdicted · none · ref 182 · internal anchor
The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sora while noting evaluation gaps.
A Tutorial on World Models and Physical AI cs.AI · 2026-06-11 · unverdicted · none · ref 35 · internal anchor
A tutorial that unifies explicit and implicit world models through shared predictive structure for applications in physical AI such as robotics.
Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends cs.CV · 2026-05-31 · unverdicted · none · ref 111 · internal anchor
This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer