pith. machine review for the scientific record. sign in

arxiv: 2503.20523 · v1 · submitted 2025-03-26 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords generative world modelautonomous drivingvideo generationlatent diffusionmulti-camera consistencycontrollable simulationsynthetic dataspatiotemporal consistency
0
0 comments X

The pith

GAIA-2 generates high-resolution multi-camera driving videos from structured inputs like vehicle dynamics, agent positions, and road semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GAIA-2 as a latent diffusion world model built to meet autonomous driving needs that current generative approaches miss. It produces spatiotemporally consistent videos across multiple cameras by conditioning on ego-vehicle dynamics, agent configurations, environmental factors, and road semantics, while also accepting external latent embeddings. The work targets scalable creation of both everyday and rare driving scenes in varied locations such as the UK, US, and Germany. If the outputs prove usable, they could supply synthetic training data that reduces dependence on costly real-world collection. A sympathetic reader would focus on whether this unified conditioning framework actually delivers the control and consistency required for downstream autonomous systems.

Core claim

GAIA-2 is a latent diffusion world model that unifies controllable multi-view video generation within one framework. It conditions generation on structured inputs consisting of ego-vehicle dynamics, agent configurations, environmental factors, and road semantics, and further integrates external latent embeddings to support semantically grounded scene synthesis. The model produces high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments.

What carries the argument

Latent diffusion world model that combines structured conditioning signals with external latent embeddings for flexible, multi-view scene synthesis.

If this is right

  • Enables scalable simulation of both common and rare driving scenarios without additional real-world data collection.
  • Supports multi-agent interactions and multi-camera consistency in a single generative pass.
  • Allows flexible conditioning that mixes structured inputs with external model embeddings for semantically controlled outputs.
  • Provides synthetic data usable across geographically diverse environments including the UK, US, and Germany.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Could lower the barrier to testing dangerous or low-probability edge cases by generating them on demand rather than waiting for real occurrences.
  • Opens the possibility of closed-loop simulation where generated scenes feed directly into planning models that then influence the next generation step.
  • Might accelerate iteration on autonomous systems by allowing rapid creation of targeted datasets focused on specific failure modes.

Load-bearing premise

The generated videos must be realistic, consistent, and free of artifacts so that models trained on them transfer effectively to real vehicles without new biases or failures.

What would settle it

Train an autonomous driving perception or planning model exclusively on GAIA-2 videos and measure its performance on held-out real-world driving data; performance that matches or exceeds a real-data baseline would support the claim.

read the original abstract

Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at https://wayve.ai/thinking/gaia-2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GAIA-2, a latent diffusion world model for autonomous driving that unifies controllable multi-view video generation conditioned on ego-vehicle dynamics, agent configurations, environmental factors, road semantics, and external latent embeddings from a proprietary driving model. It claims to produce high-resolution, spatiotemporally consistent videos across UK, US, and German environments to enable scalable simulation of common and rare scenarios.

Significance. If the controllability and consistency claims hold with supporting evidence, GAIA-2 could meaningfully advance generative world models as simulation tools for AV development by allowing flexible conditioning on structured inputs, potentially improving data diversity for perception and planning training.

major comments (2)
  1. [Experiments] Experiments section: The manuscript presents only qualitative video results and a project page link, with no reported quantitative metrics (e.g., FVD, FID, cross-view alignment scores, or temporal coherence measures) or ablation studies on the structured conditioning versus latent embeddings. This directly undermines the load-bearing claim that the outputs are sufficiently realistic and consistent for downstream AV training without introducing biases.
  2. [§3] §3 (Model Architecture): The integration of structured inputs (ego dynamics, agents, semantics) with external latent embeddings is described at a high level, but no equations or diagrams detail how these are fused in the diffusion process or how multi-camera consistency is enforced across views, leaving the spatiotemporal consistency mechanism unverified.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'geographically diverse driving environments (UK, US, Germany)' would benefit from a brief statement on dataset scale or scene variety to contextualize the qualitative examples.
  2. [§3] Notation: The distinction between 'structured conditioning' and 'external latent embeddings' is introduced without a clear table or diagram summarizing input types and their dimensionalities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on GAIA-2. We agree that strengthening the quantitative evaluation and architectural details will improve the manuscript and will revise accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The manuscript presents only qualitative video results and a project page link, with no reported quantitative metrics (e.g., FVD, FID, cross-view alignment scores, or temporal coherence measures) or ablation studies on the structured conditioning versus latent embeddings. This directly undermines the load-bearing claim that the outputs are sufficiently realistic and consistent for downstream AV training without introducing biases.

    Authors: We acknowledge that the current manuscript relies primarily on qualitative demonstrations. In the revision we will add quantitative metrics including FVD, FID, cross-view alignment scores, and temporal coherence measures. We will also include ablation studies isolating the contributions of structured conditioning versus external latent embeddings. These additions will provide direct empirical support for the realism and consistency claims relevant to AV training. revision: yes

  2. Referee: [§3] §3 (Model Architecture): The integration of structured inputs (ego dynamics, agents, semantics) with external latent embeddings is described at a high level, but no equations or diagrams detail how these are fused in the diffusion process or how multi-camera consistency is enforced across views, leaving the spatiotemporal consistency mechanism unverified.

    Authors: We agree that the current description in §3 is high-level. In the revised manuscript we will add explicit equations for the fusion of structured inputs and latent embeddings inside the diffusion U-Net, together with diagrams showing the cross-view attention and temporal modeling components used to enforce spatiotemporal consistency across cameras. revision: yes

Circularity Check

0 steps flagged

GAIA-2 presented as new model construction with no circular derivation chain

full rationale

The paper introduces GAIA-2 as a latent diffusion world model supporting controllable multi-view video generation via structured conditioning inputs (ego dynamics, agents, environment, road semantics) plus optional latent embeddings. No equations, predictions, or first-principles results are claimed that reduce by construction to fitted parameters or prior self-citations within the same framework. The architecture is described as a new unification of capabilities rather than a derivation from existing fitted quantities, and no uniqueness theorems or ansatzes are imported via self-citation. The central claims rest on the model's design and qualitative outputs, which are independent of any internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard assumptions of latent diffusion models being able to produce consistent multi-view outputs when given rich conditioning; no explicit free parameters or invented physical entities are detailed in the abstract.

axioms (1)
  • domain assumption Latent diffusion models can be conditioned on structured driving inputs to produce spatiotemporally consistent multi-camera video
    Core premise of the GAIA-2 framework stated in the abstract
invented entities (1)
  • GAIA-2 no independent evidence
    purpose: Unified generative world model for controllable multi-view driving video synthesis
    New model name and architecture introduced by the paper

pith-pipeline@v0.9.0 · 5511 in / 1222 out tokens · 35100 ms · 2026-05-15T13:44:17.546267+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  2. Is Your Driving World Model an All-Around Player?

    cs.CV 2026-05 unverdicted novelty 7.0

    WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.

  3. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  4. VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

    cs.RO 2026-04 unverdicted novelty 7.0

    VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

  5. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  6. ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.

  7. Training Agents Inside of Scalable World Models

    cs.AI 2025-09 conditional novelty 7.0

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  8. Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes

    cs.CV 2026-05 unverdicted novelty 6.0

    Real2Sim reconstructs editable dynamic driving scenes as temporally continuous Gaussians integrated with a differentiable MPM physics solver for high-fidelity simulation of interactions and collisions.

  9. Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...

  10. LA-Pose: Latent Action Pretraining Meets Pose Estimation

    cs.CV 2026-04 unverdicted novelty 6.0

    LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...

  11. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  12. Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

  13. LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.

  14. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  15. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  16. Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...

  17. Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic

    cs.AI 2026-04 unverdicted novelty 5.0

    This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.

  18. Ozone: A Unified Platform for Transportation Research

    cs.DB 2026-04 unverdicted novelty 5.0

    Ozone unifies four trajectory datasets into a canonical format with standardized schemas and provides CARLA-based benchmarking, claiming 85% faster experiment setup and 91% cross-city transfer efficiency.

  19. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 19 Pith papers · 2 internal anchors

  1. [1]

    D. P. Kingma and M. Welling. Auto-encoding variational bayes.Proceedings of the International Conference on Learning Representations (ICLR) , 2014

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025. 18

  3. [3]

    van den Oord, O

    A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS) , 2017

  4. [4]

    Esser, R

    P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021

  5. [5]

    A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving. Technical Report arXiv:2309.17080, 2023

  6. [6]

    X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. Drivedreamer: Towards real-world- driven world models for autonomous driving. Proceedings of the European Conference on Computer Vision (ECCV), 2024

  7. [7]

    Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14749–14759, 2024

  8. [8]

    S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. Advances in Neural Information Processing Systems (NeurIPS) , 2024

  9. [9]

    G. Zhao, C. Ni, X. Wang, Z. Zhu, X. Zhang, Y . Wang, G. Huang, X. Chen, B. Wang, Y . Zhang, et al. Drivedreamer4d: World models are effective data machines for 4d driving scene represen- tation. arXiv preprint arXiv:2410.13571, 2024

  10. [10]

    J. Chen, H. Cai, J. Chen, E. Xie, S. Yang, H. Tang, M. Li, Y . Lu, and S. Han. Deep compression autoencoder for efficient high-resolution diffusion models. Proceedings of the International Conference on Learning Representations (ICLR) , 2025

  11. [11]

    HaCohen, N

    Y . HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, P. Panet, S. Weissbuch, V . Kulikov, Y . Bitterman, Z. Melumian, and O. Bibi. Ltx-video: Realtime video latent diffusion. arXiv preprint, 2024

  12. [12]

    W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  13. [13]

    Johnson, A

    J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super- resolution. In Proceedings of the European Conference on Computer Vision (ECCV) , 2016

  14. [14]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without s...

  15. [15]

    R. Zhang. Making convolutional networks shift-invariant again. In Proceedings of the Interna- tional Conference on Machine Learning (ICML) , 2019

  16. [16]

    Miyato, T

    T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida. Spectral normalization for generative adver- sarial networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018

  17. [17]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. Proceedings of the International Conference on Learning Representations (ICLR) , 2023

  18. [18]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020. 19

  19. [19]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , 2023

  20. [20]

    Polyak, A

    A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint, 2024

  21. [21]

    Dehghani, J

    M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, R. Jenatton, L. Beyer, M. Tschannen, A. Arnab, X. Wang, C. Riquelme, M. Minderer, J. Puigcerver, U. Evci, M. Kumar, S. van Steenkiste, G. F. Elsayed, A. Mahendran, F. Yu, A. Oliver, F. Huot, J. Bastings, M. P. Collier, A. Gritsenko, ...

  22. [22]

    J. B. W. Webber. A bi-symmetric log transformation for wide-range data. Measurement Science and Technology, 2012

  23. [23]

    S. Wang, Y . Liu, T. Wang, Y . Li, and X. Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3621–3631, 2023

  24. [24]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning (ICML), 2021

  25. [25]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  26. [26]

    Zhang, P

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  27. [27]

    Stein, J

    G. Stein, J. C. Cresswell, R. Hosseinzadeh, Y . Sui, B. L. Ross, V . Villecroze, Z. Liu, A. L. Caterini, J. E. T. Taylor, and G. Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  28. [28]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  29. [29]

    Szegedy, V

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception archi- tecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  30. [30]

    J. Liu, Y . Qu, Q. Yan, X. Zeng, L. Wang, and R. Liao. Fréchet video motion distance: A metric for evaluating motion consistency in videos. In Proceedings of the International Conference on Machine Learning, workshop (ICMLw), 2024

  31. [31]

    Unterthiner, S

    T. Unterthiner, S. Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint, 2018

  32. [32]

    J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi. Oneformer: One transformer to rule universal image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  33. [33]

    Cosmos tokenizer: A suite of image and video neural tokenizers

    NVIDIA. Cosmos tokenizer: A suite of image and video neural tokenizers. https: //research.nvidia.com/labs/dir/cosmos-tokenizer, 2024

  34. [34]

    W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas. VideoGPT: Video generation using vq-vae and transformers. In arXiv preprint, 2021. 20

  35. [35]

    G. L. Moing, J. Ponce, and C. Schmid. CCVS: Context-aware controllable video synthesis. In Advances in Neural Information Processing Systems (NeurIPS) , 2021

  36. [36]

    S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. Proceedings of the European Conference on Computer Vision (ECCV), 2022

  37. [37]

    Y . Seo, K. Lee, F. Liu, S. James, and P. Abbeel. HARP: Autoregressive latent video prediction with high-fidelity image generator. In Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022

  38. [38]

    Hawthorne, A

    C. Hawthorne, A. Jaegle, C. Cangea, S. Borgeaud, C. Nash, M. Malinowski, S. Dieleman, O. Vinyals, M. Botvinick, I. Simon, H. Sheahan, N. Zeghidour, J.-B. Alayrac, J. Carreira, and J. Engel. General-purpose, long-context autoregressive modeling with Perceiver AR. In Proceedings of the International Conference on Machine Learning (ICML) , 2022

  39. [39]

    Micheli, E

    V . Micheli, E. Alonso, and F. Fleuret. Transformers are sample-efficient world models.Pro- ceedings of the International Conference on Learning Representations (ICLR) , 2023

  40. [40]

    W. Yan, D. Hafner, S. James, and P. Abbeel. Temporally consistent transformers for video generation. In Proceedings of the International Conference on Machine Learning (ICML) , 2023

  41. [41]

    Villegas, M

    R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan. Phenaki: Variable length video generation from open domain textual description. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

  42. [42]

    L. Yu, Y . Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y . Hao, I. Essa, and L. Jiang. MAGVIT: Masked Generative Video Transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

  43. [43]

    Blattmann, T

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023

  44. [44]

    Alonso, A

    E. Alonso, A. Jelley, V . Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  45. [45]

    Video generation models as world simulators

    OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators , 2024

  46. [46]

    Introducing gen-3 alpha: A new frontier for video generation

    Runway. Introducing gen-3 alpha: A new frontier for video generation. https://runwayml. com/research/introducing-gen-3-alpha , 2024

  47. [47]

    comma.ai. commavq. https://github.com/commaai/commavq, 2023

  48. [48]

    R. Chen, Z. Wu, Y . Liu, Y . Guo, J. Ni, H. Xia, and S. Xia. Unimlvg: Unified framework for multi-view long video generation with comprehensive control capabilities for autonomous driving. arXiv preprint arXiv:2412.04842, 2024

  49. [49]

    J. Ni, Y . Guo, Y . Liu, R. Chen, L. Lu, and Z. Wu. Maskgwm: A generalizable driving world model with video mask reconstruction. arXiv preprint arXiv:2502.11663, 2025

  50. [50]

    E. Ma, L. Zhou, T. Tang, Z. Zhang, D. Han, J. Jiang, K. Zhan, P. Jia, X. Lang, H. Sun, et al. Unleashing generalization of end-to-end autonomous driving with controllable long video generation. arXiv preprint arXiv:2406.01349, 2024

  51. [51]

    Hassan, S

    M. Hassan, S. Stapf, A. Rahimi, P. Rezende, Y . Haghighi, D. Brüggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. arXiv preprint arXiv:2412.11198, 2024

  52. [52]

    J. Mao, B. Li, B. Ivanovic, Y . Chen, Y . Wang, Y . You, C. Xiao, D. Xu, M. Pavone, and Y . Wang. Dreamdrive: Generative 4d scene modeling from street view images.arXiv preprint arXiv:2501.00601, 2024. 21