arxiv: 2503.20523 · v1 · submitted 2025-03-26 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell , Anthony Hu , Lorenzo Bertoni , George Fedoseev , Jamie Shotton , Elahe Arani , Gianluca Corrado

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords generative world modelautonomous drivingvideo generationlatent diffusionmulti-camera consistencycontrollable simulationsynthetic dataspatiotemporal consistency

0 comments

The pith

GAIA-2 generates high-resolution multi-camera driving videos from structured inputs like vehicle dynamics, agent positions, and road semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GAIA-2 as a latent diffusion world model built to meet autonomous driving needs that current generative approaches miss. It produces spatiotemporally consistent videos across multiple cameras by conditioning on ego-vehicle dynamics, agent configurations, environmental factors, and road semantics, while also accepting external latent embeddings. The work targets scalable creation of both everyday and rare driving scenes in varied locations such as the UK, US, and Germany. If the outputs prove usable, they could supply synthetic training data that reduces dependence on costly real-world collection. A sympathetic reader would focus on whether this unified conditioning framework actually delivers the control and consistency required for downstream autonomous systems.

Core claim

GAIA-2 is a latent diffusion world model that unifies controllable multi-view video generation within one framework. It conditions generation on structured inputs consisting of ego-vehicle dynamics, agent configurations, environmental factors, and road semantics, and further integrates external latent embeddings to support semantically grounded scene synthesis. The model produces high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments.

What carries the argument

Latent diffusion world model that combines structured conditioning signals with external latent embeddings for flexible, multi-view scene synthesis.

If this is right

Enables scalable simulation of both common and rare driving scenarios without additional real-world data collection.
Supports multi-agent interactions and multi-camera consistency in a single generative pass.
Allows flexible conditioning that mixes structured inputs with external model embeddings for semantically controlled outputs.
Provides synthetic data usable across geographically diverse environments including the UK, US, and Germany.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Could lower the barrier to testing dangerous or low-probability edge cases by generating them on demand rather than waiting for real occurrences.
Opens the possibility of closed-loop simulation where generated scenes feed directly into planning models that then influence the next generation step.
Might accelerate iteration on autonomous systems by allowing rapid creation of targeted datasets focused on specific failure modes.

Load-bearing premise

The generated videos must be realistic, consistent, and free of artifacts so that models trained on them transfer effectively to real vehicles without new biases or failures.

What would settle it

Train an autonomous driving perception or planning model exclusively on GAIA-2 videos and measure its performance on held-out real-world driving data; performance that matches or exceeds a real-data baseline would support the claim.

read the original abstract

Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia-2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAIA-2 adds structured multi-view conditioning to latent diffusion for driving video but the abstract gives no numbers to show the outputs are clean enough for real AV training.

read the letter

The paper introduces GAIA-2 as a latent diffusion model that takes ego dynamics, agent layouts, road semantics and environmental factors as conditioning and produces consistent multi-camera video across different geographies. It also folds in external embeddings from a separate driving model. That combination is the main technical step: it tries to give controllable generation while keeping spatial and temporal coherence across views, which matters for simulation of multi-agent scenes. The work is aimed squarely at the practical gap in autonomous driving where real data collection is expensive and rare events are hard to capture. On that front the architecture description looks reasonable and the choice of inputs matches what AV teams actually need to control. The geographic spread mentioned (UK, US, Germany) is a small plus for showing the model is not overfit to one dataset. The soft spot is the lack of any quantitative evidence. The abstract and stress-test note both point to no FID, FVD, cross-view alignment scores, temporal coherence numbers, or ablation on the structured versus latent conditioning. There are also no closed-loop transfer results showing that perception or planning models trained on the generated videos hold up on real roads without new failure modes. Without those, the claim that this advances generative world models as a core tool stays qualitative. The videos on the project page may look good, but that is not enough to judge distribution shift or artifact impact. This is the kind of paper that belongs in a reading group for people building driving simulators or data pipelines. A reader working on generative simulation would pick up the conditioning scheme and see how it differs from prior diffusion work in this domain. It deserves peer review because the problem is important and the framing is honest about the target use case, even though the current version is light on validation. I would ask for metrics and at least one transfer experiment before accepting.

Referee Report

2 major / 2 minor

Summary. The paper introduces GAIA-2, a latent diffusion world model for autonomous driving that unifies controllable multi-view video generation conditioned on ego-vehicle dynamics, agent configurations, environmental factors, road semantics, and external latent embeddings from a proprietary driving model. It claims to produce high-resolution, spatiotemporally consistent videos across UK, US, and German environments to enable scalable simulation of common and rare scenarios.

Significance. If the controllability and consistency claims hold with supporting evidence, GAIA-2 could meaningfully advance generative world models as simulation tools for AV development by allowing flexible conditioning on structured inputs, potentially improving data diversity for perception and planning training.

major comments (2)

[Experiments] Experiments section: The manuscript presents only qualitative video results and a project page link, with no reported quantitative metrics (e.g., FVD, FID, cross-view alignment scores, or temporal coherence measures) or ablation studies on the structured conditioning versus latent embeddings. This directly undermines the load-bearing claim that the outputs are sufficiently realistic and consistent for downstream AV training without introducing biases.
[§3] §3 (Model Architecture): The integration of structured inputs (ego dynamics, agents, semantics) with external latent embeddings is described at a high level, but no equations or diagrams detail how these are fused in the diffusion process or how multi-camera consistency is enforced across views, leaving the spatiotemporal consistency mechanism unverified.

minor comments (2)

[Abstract] Abstract: The claim of 'geographically diverse driving environments (UK, US, Germany)' would benefit from a brief statement on dataset scale or scene variety to contextualize the qualitative examples.
[§3] Notation: The distinction between 'structured conditioning' and 'external latent embeddings' is introduced without a clear table or diagram summarizing input types and their dimensionalities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on GAIA-2. We agree that strengthening the quantitative evaluation and architectural details will improve the manuscript and will revise accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: The manuscript presents only qualitative video results and a project page link, with no reported quantitative metrics (e.g., FVD, FID, cross-view alignment scores, or temporal coherence measures) or ablation studies on the structured conditioning versus latent embeddings. This directly undermines the load-bearing claim that the outputs are sufficiently realistic and consistent for downstream AV training without introducing biases.

Authors: We acknowledge that the current manuscript relies primarily on qualitative demonstrations. In the revision we will add quantitative metrics including FVD, FID, cross-view alignment scores, and temporal coherence measures. We will also include ablation studies isolating the contributions of structured conditioning versus external latent embeddings. These additions will provide direct empirical support for the realism and consistency claims relevant to AV training. revision: yes
Referee: [§3] §3 (Model Architecture): The integration of structured inputs (ego dynamics, agents, semantics) with external latent embeddings is described at a high level, but no equations or diagrams detail how these are fused in the diffusion process or how multi-camera consistency is enforced across views, leaving the spatiotemporal consistency mechanism unverified.

Authors: We agree that the current description in §3 is high-level. In the revised manuscript we will add explicit equations for the fusion of structured inputs and latent embeddings inside the diffusion U-Net, together with diagrams showing the cross-view attention and temporal modeling components used to enforce spatiotemporal consistency across cameras. revision: yes

Circularity Check

0 steps flagged

GAIA-2 presented as new model construction with no circular derivation chain

full rationale

The paper introduces GAIA-2 as a latent diffusion world model supporting controllable multi-view video generation via structured conditioning inputs (ego dynamics, agents, environment, road semantics) plus optional latent embeddings. No equations, predictions, or first-principles results are claimed that reduce by construction to fitted parameters or prior self-citations within the same framework. The architecture is described as a new unification of capabilities rather than a derivation from existing fitted quantities, and no uniqueness theorems or ansatzes are imported via self-citation. The central claims rest on the model's design and qualitative outputs, which are independent of any internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard assumptions of latent diffusion models being able to produce consistent multi-view outputs when given rich conditioning; no explicit free parameters or invented physical entities are detailed in the abstract.

axioms (1)

domain assumption Latent diffusion models can be conditioned on structured driving inputs to produce spatiotemporally consistent multi-camera video
Core premise of the GAIA-2 framework stated in the abstract

invented entities (1)

GAIA-2 no independent evidence
purpose: Unified generative world model for controllable multi-view driving video synthesis
New model name and architecture introduced by the paper

pith-pipeline@v0.9.0 · 5511 in / 1222 out tokens · 35100 ms · 2026-05-15T13:44:17.546267+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 conditional novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
Is Your Driving World Model an All-Around Player?
cs.CV 2026-05 unverdicted novelty 7.0

WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation
cs.CV 2026-04 unverdicted novelty 7.0

ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes
cs.CV 2026-05 unverdicted novelty 6.0

Real2Sim reconstructs editable dynamic driving scenes as temporally continuous Gaussians integrated with a differentiable MPM physics solver for high-fidelity simulation of interactions and collisions.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
LA-Pose: Latent Action Pretraining Meets Pose Estimation
cs.CV 2026-04 unverdicted novelty 6.0

LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
cs.CV 2026-04 unverdicted novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
cs.CV 2026-04 unverdicted novelty 6.0

LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
cs.CV 2026-04 unverdicted novelty 5.0

Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...
Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic
cs.AI 2026-04 unverdicted novelty 5.0

This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.
Ozone: A Unified Platform for Transportation Research
cs.DB 2026-04 unverdicted novelty 5.0

Ozone unifies four trajectory datasets into a canonical format with standardized schemas and provides CARLA-based benchmarking, claiming 85% faster experiment setup and 91% cross-city transfer efficiency.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 19 Pith papers · 2 internal anchors

[1]

D. P. Kingma and M. Welling. Auto-encoding variational bayes.Proceedings of the International Conference on Learning Representations (ICLR) , 2014

work page 2014
[2]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025. 18

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

van den Oord, O

A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS) , 2017

work page 2017
[4]

Esser, R

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021

work page 2021
[5]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving. Technical Report arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. Drivedreamer: Towards real-world- driven world models for autonomous driving. Proceedings of the European Conference on Computer Vision (ECCV), 2024

work page 2024
[7]

Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14749–14759, 2024

work page 2024
[8]

S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. Advances in Neural Information Processing Systems (NeurIPS) , 2024

work page 2024
[9]

G. Zhao, C. Ni, X. Wang, Z. Zhu, X. Zhang, Y . Wang, G. Huang, X. Chen, B. Wang, Y . Zhang, et al. Drivedreamer4d: World models are effective data machines for 4d driving scene represen- tation. arXiv preprint arXiv:2410.13571, 2024

work page arXiv 2024
[10]

J. Chen, H. Cai, J. Chen, E. Xie, S. Yang, H. Tang, M. Li, Y . Lu, and S. Han. Deep compression autoencoder for efficient high-resolution diffusion models. Proceedings of the International Conference on Learning Representations (ICLR) , 2025

work page 2025
[11]

HaCohen, N

Y . HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, P. Panet, S. Weissbuch, V . Kulikov, Y . Bitterman, Z. Melumian, and O. Bibi. Ltx-video: Realtime video latent diffusion. arXiv preprint, 2024

work page 2024
[12]

W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[13]

Johnson, A

J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super- resolution. In Proceedings of the European Conference on Computer Vision (ECCV) , 2016

work page 2016
[14]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without s...

work page 2024
[15]

R. Zhang. Making convolutional networks shift-invariant again. In Proceedings of the Interna- tional Conference on Machine Learning (ICML) , 2019

work page 2019
[16]

Miyato, T

T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida. Spectral normalization for generative adver- sarial networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018

work page 2018
[17]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. Proceedings of the International Conference on Learning Representations (ICLR) , 2023

work page 2023
[18]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020. 19

work page 2020
[19]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , 2023

work page 2023
[20]

Polyak, A

A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint, 2024

work page 2024
[21]

Dehghani, J

M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, R. Jenatton, L. Beyer, M. Tschannen, A. Arnab, X. Wang, C. Riquelme, M. Minderer, J. Puigcerver, U. Evci, M. Kumar, S. van Steenkiste, G. F. Elsayed, A. Mahendran, F. Yu, A. Oliver, F. Huot, J. Bastings, M. P. Collier, A. Gritsenko, ...

work page 2023
[22]

J. B. W. Webber. A bi-symmetric log transformation for wide-range data. Measurement Science and Technology, 2012

work page 2012
[23]

S. Wang, Y . Liu, T. Wang, Y . Li, and X. Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3621–3631, 2023

work page 2023
[24]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning (ICML), 2021

work page 2021
[25]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[26]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[27]

Stein, J

G. Stein, J. C. Cresswell, R. Hosseinzadeh, Y . Sui, B. L. Ross, V . Villecroze, Z. Liu, A. L. Caterini, J. E. T. Taylor, and G. Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[28]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[29]

Szegedy, V

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception archi- tecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[30]

J. Liu, Y . Qu, Q. Yan, X. Zeng, L. Wang, and R. Liao. Fréchet video motion distance: A metric for evaluating motion consistency in videos. In Proceedings of the International Conference on Machine Learning, workshop (ICMLw), 2024

work page 2024
[31]

Unterthiner, S

T. Unterthiner, S. Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint, 2018

work page 2018
[32]

J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi. Oneformer: One transformer to rule universal image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[33]

Cosmos tokenizer: A suite of image and video neural tokenizers

NVIDIA. Cosmos tokenizer: A suite of image and video neural tokenizers. https: //research.nvidia.com/labs/dir/cosmos-tokenizer, 2024

work page 2024
[34]

W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas. VideoGPT: Video generation using vq-vae and transformers. In arXiv preprint, 2021. 20

work page 2021
[35]

G. L. Moing, J. Ponce, and C. Schmid. CCVS: Context-aware controllable video synthesis. In Advances in Neural Information Processing Systems (NeurIPS) , 2021

work page 2021
[36]

S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. Proceedings of the European Conference on Computer Vision (ECCV), 2022

work page 2022
[37]

Y . Seo, K. Lee, F. Liu, S. James, and P. Abbeel. HARP: Autoregressive latent video prediction with high-fidelity image generator. In Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022

work page 2022
[38]

Hawthorne, A

C. Hawthorne, A. Jaegle, C. Cangea, S. Borgeaud, C. Nash, M. Malinowski, S. Dieleman, O. Vinyals, M. Botvinick, I. Simon, H. Sheahan, N. Zeghidour, J.-B. Alayrac, J. Carreira, and J. Engel. General-purpose, long-context autoregressive modeling with Perceiver AR. In Proceedings of the International Conference on Machine Learning (ICML) , 2022

work page 2022
[39]

Micheli, E

V . Micheli, E. Alonso, and F. Fleuret. Transformers are sample-efficient world models.Pro- ceedings of the International Conference on Learning Representations (ICLR) , 2023

work page 2023
[40]

W. Yan, D. Hafner, S. James, and P. Abbeel. Temporally consistent transformers for video generation. In Proceedings of the International Conference on Machine Learning (ICML) , 2023

work page 2023
[41]

Villegas, M

R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan. Phenaki: Variable length video generation from open domain textual description. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023
[42]

L. Yu, Y . Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y . Hao, I. Essa, and L. Jiang. MAGVIT: Masked Generative Video Transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023
[43]

Blattmann, T

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023

work page 2023
[44]

Alonso, A

E. Alonso, A. Jelley, V . Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[45]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators , 2024

work page 2024
[46]

Introducing gen-3 alpha: A new frontier for video generation

Runway. Introducing gen-3 alpha: A new frontier for video generation. https://runwayml. com/research/introducing-gen-3-alpha , 2024

work page 2024
[47]

comma.ai. commavq. https://github.com/commaai/commavq, 2023

work page 2023
[48]

R. Chen, Z. Wu, Y . Liu, Y . Guo, J. Ni, H. Xia, and S. Xia. Unimlvg: Unified framework for multi-view long video generation with comprehensive control capabilities for autonomous driving. arXiv preprint arXiv:2412.04842, 2024

work page arXiv 2024
[49]

J. Ni, Y . Guo, Y . Liu, R. Chen, L. Lu, and Z. Wu. Maskgwm: A generalizable driving world model with video mask reconstruction. arXiv preprint arXiv:2502.11663, 2025

work page arXiv 2025
[50]

E. Ma, L. Zhou, T. Tang, Z. Zhang, D. Han, J. Jiang, K. Zhan, P. Jia, X. Lang, H. Sun, et al. Unleashing generalization of end-to-end autonomous driving with controllable long video generation. arXiv preprint arXiv:2406.01349, 2024

work page arXiv 2024
[51]

Hassan, S

M. Hassan, S. Stapf, A. Rahimi, P. Rezende, Y . Haghighi, D. Brüggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. arXiv preprint arXiv:2412.11198, 2024

work page arXiv 2024
[52]

J. Mao, B. Li, B. Ivanovic, Y . Chen, Y . Wang, Y . You, C. Xiao, D. Xu, M. Pavone, and Y . Wang. Dreamdrive: Generative 4d scene modeling from street view images.arXiv preprint arXiv:2501.00601, 2024. 21

work page arXiv 2024