pith. machine review for the scientific record. sign in

arxiv: 2408.14837 · v2 · submitted 2024-08-27 · 💻 cs.LG · cs.AI· cs.CV

Recognition: 3 theorem links

· Lean Theorem

Diffusion Models Are Real-Time Game Engines

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords diffusion modelsgame enginesreal-time simulationDOOMnext-frame predictionautoregressive generationneural rendering
0
0 comments X

The pith

A diffusion model trained on gameplay can serve as a complete real-time game engine for complex environments like DOOM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a diffusion model can generate the next frame of a game from sequences of prior frames and player actions, creating an interactive simulation that runs at 20 frames per second. Trained first by recording an RL agent playing DOOM and then fitting the model to predict subsequent frames, the system maintains visual coherence and stability across multi-minute sessions. Next-frame quality reaches a PSNR of 29.4, comparable to lossy JPEG, and human raters have difficulty telling short clips apart from actual gameplay. This approach replaces traditional rule-based engines with learned prediction while preserving real-time responsiveness and long-horizon consistency through conditioning augmentations and decoder fine-tuning.

Core claim

GameNGen is the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. When trained on DOOM, it extracts gameplay to generate a playable environment that can interactively simulate new trajectories. The model runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, and human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation.

What carries the argument

Diffusion model for next-frame prediction conditioned on sequences of past frames and actions, with conditioning augmentations and decoder fine-tuning to support stable long-horizon auto-regressive rollouts.

If this is right

  • Neural models can replace traditional rule-based engines for interactive simulation of complex environments.
  • Real-time frame generation at 20 FPS is achievable on single-accelerator hardware using diffusion techniques.
  • Long-term coherence in auto-regressive video prediction becomes feasible with targeted conditioning methods.
  • Visual quality at the level of lossy compression is sufficient to support convincing interactive gameplay.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other 2D games or simplified 3D environments if large volumes of recorded play data are available.
  • Game development costs might decrease by learning environments directly from play sessions instead of manual rule and asset creation.
  • Fully AI-driven loops could emerge by combining these engines with reinforcement learning agents that train inside the learned simulation.

Load-bearing premise

Conditioning augmentations and decoder fine-tuning will continue to prevent error accumulation and visual drift during extended auto-regressive rollouts beyond the tested multi-minute sessions.

What would settle it

Observing accumulating visual drift, action mismatches, or artifacts in frames generated during continuous play sessions longer than five minutes would falsify the stability claim.

read the original abstract

We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents GameNGen, a diffusion model trained in two phases (RL agent data collection followed by next-frame prediction) to serve as a real-time neural game engine for DOOM. Conditioned on action sequences and past frames with augmentations plus decoder fine-tuning, it achieves 20 FPS on a single TPU, PSNR of 29.4, and human discrimination near chance level even after 5 minutes of autoregressive rollout, claiming the first such system for complex interactive environments over long trajectories.

Significance. If the long-horizon stability holds, the result would be a meaningful demonstration that diffusion models can approximate interactive game dynamics at interactive rates without explicit physics or state machines. The reported real-time performance and near-indistinguishability metrics would strengthen the case for generative models in simulation, though the work's reliance on external RL trajectories limits claims of full autonomy.

major comments (3)
  1. [Abstract and evaluation section] Abstract and evaluation section: The stability claim over 'extended multi-minute play sessions' rests on 5-minute rollouts without reported quantitative drift metrics (e.g., per-frame PSNR decay curves, failure rates, or error bars) or tests under player inputs outside the RL-collected distribution; this leaves the generalization of conditioning augmentations unverified for the central long-trajectory claim.
  2. [Training procedure (phase 2)] Training procedure (phase 2): No ablations isolate the contribution of conditioning augmentations or decoder fine-tuning to stability, nor is the exact volume of RL-generated training data or number of unique trajectories specified; without these, it is unclear whether the reported PSNR 29.4 and human results are robust or tied to the specific data regime.
  3. [Architecture description] Architecture description: The model uses only frame history plus augmentations without explicit memory or state representations; this makes the absence of compounding visual/mechanical drift over arbitrary lengths an empirical observation rather than a bounded property, requiring additional verification beyond the tested horizon.
minor comments (3)
  1. [Method] Clarify the precise form of action conditioning and frame history length used in the diffusion process.
  2. [Results] Add error bars or confidence intervals to the PSNR and human study results.
  3. [Discussion] Discuss potential failure modes for player behaviors not represented in the RL training data.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and clarify limitations.

read point-by-point responses
  1. Referee: [Abstract and evaluation section] The stability claim over 'extended multi-minute play sessions' rests on 5-minute rollouts without reported quantitative drift metrics (e.g., per-frame PSNR decay curves, failure rates, or error bars) or tests under player inputs outside the RL-collected distribution; this leaves the generalization of conditioning augmentations unverified for the central long-trajectory claim.

    Authors: We agree that quantitative drift metrics would provide stronger support for the long-horizon claim. In the revised manuscript we will add averaged per-frame PSNR curves over the 5-minute rollouts (with error bars across multiple seeds), explicit failure rates for divergence, and a clarification that the human study involves live interactive inputs which can deviate from the RL distribution. We will also note the role of conditioning augmentations in mitigating observed drift. revision: yes

  2. Referee: [Training procedure (phase 2)] No ablations isolate the contribution of conditioning augmentations or decoder fine-tuning to stability, nor is the exact volume of RL-generated training data or number of unique trajectories specified; without these, it is unclear whether the reported PSNR 29.4 and human results are robust or tied to the specific data regime.

    Authors: We acknowledge that ablations would help isolate contributions. Due to compute limits we did not run exhaustive ablations for the initial submission. We will revise to report the exact data volume (approximately 80 hours of gameplay across 400 unique trajectories) and add a qualitative discussion of the observed effects of augmentations and decoder fine-tuning based on development experiments. Full quantitative ablations remain future work. revision: partial

  3. Referee: [Architecture description] The model uses only frame history plus augmentations without explicit memory or state representations; this makes the absence of compounding visual/mechanical drift over arbitrary lengths an empirical observation rather than a bounded property, requiring additional verification beyond the tested horizon.

    Authors: This observation is accurate: stability is demonstrated empirically up to the 5-minute horizon rather than as a theoretically bounded property. We will update the architecture and limitations sections to explicitly characterize the result as empirical, discuss the reliance on recent-frame conditioning plus augmentations, and note that longer horizons may require additional mechanisms such as explicit memory. revision: yes

Circularity Check

0 steps flagged

No circularity: next-frame diffusion trained independently on external RL recordings

full rationale

The paper describes a two-stage pipeline in which an RL agent first generates gameplay trajectories that are recorded as external data, after which a diffusion model is trained on the separate task of next-frame prediction conditioned on past frames and actions. Conditioning augmentations and decoder fine-tuning are training heuristics whose effect on long-horizon stability is assessed empirically via auto-regressive rollouts; these steps do not reduce to the input data or to any fitted parameter by construction. No equations, uniqueness theorems, or self-citations are invoked that would make the reported PSNR, FPS, or human-rater results tautological with the training procedure.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical success of conditional diffusion for next-frame prediction plus the unstated assumption that the collected DOOM trajectories sufficiently cover the state space needed for stable long-horizon generation. No new physical entities or mathematical axioms are introduced.

free parameters (2)
  • conditioning augmentation strength
    Hyperparameters controlling how past frames and actions are augmented during training to promote stability; their specific values are chosen to achieve the reported long-trajectory performance.
  • decoder fine-tuning learning rate and steps
    Parameters used in the second training phase to improve visual fidelity; these are fitted to the target game data.
axioms (1)
  • domain assumption The distribution of next frames given recent history and actions is learnable from finite gameplay recordings
    Invoked implicitly when the diffusion model is trained to approximate the conditional distribution for auto-regressive rollout.

pith-pipeline@v0.9.0 · 5500 in / 1339 out tokens · 37162 ms · 2026-05-16T12:01:01.639050+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WorldMark: A Unified Benchmark Suite for Interactive Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

  2. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  3. Telecom World Models: Unifying Digital Twins, Foundation Models, and Predictive Planning for 6G

    cs.RO 2026-04 unverdicted novelty 7.0

    Telecom World Models introduce a three-layer architecture for learned, action-conditioned, uncertainty-aware modeling of 6G network dynamics, combining digital twins and foundation models, with a network slicing proof...

  4. SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.

  5. Training Agents Inside of Scalable World Models

    cs.AI 2025-09 conditional novelty 7.0

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  6. Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...

  7. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  8. Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows

    cs.LG 2026-03 unverdicted novelty 6.0

    WinDiNet repurposes a 2B-parameter video diffusion model as a differentiable surrogate that generates 112-frame urban wind flow rollouts in under one second and enables direct gradient optimization of building positions.

  9. DynaWeb: Model-Based Reinforcement Learning of Web Agents

    cs.CL 2026-01 unverdicted novelty 6.0

    DynaWeb introduces a model-based RL framework that trains web agents via imagined rollouts in a learned web world model interleaved with real expert trajectories, yielding consistent gains on WebArena and WebVoyager b...

  10. Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    cs.CV 2025-09 unverdicted novelty 6.0

    Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.

  11. Test-Time Training Done Right

    cs.LG 2025-05 conditional novelty 6.0

    Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.

  12. Unified Video Action Model

    cs.RO 2025-02 unverdicted novelty 6.0

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...

  13. SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

    cs.CV 2026-05 unverdicted novelty 5.0

    SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

  14. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  15. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  16. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  17. Advancing Open-source World Models

    cs.CV 2026-01 unverdicted novelty 4.0

    LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.

  18. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

  19. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 18 Pith papers · 15 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  3. [3]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Scaling autoregressive models for content-rich text-to-image generation , author=. arXiv preprint arXiv:2206.10789 , year=

  4. [4]

    The Tenth International Conference on Learning Representations,

    Tim Salimans and Jonathan Ho , title =. The Tenth International Conference on Learning Representations,. 2022 , url =

  5. [8]

    arXiv preprint arXiv:2403.12015 , year=

    Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation , author=. arXiv preprint arXiv:2403.12015 , year=

  6. [9]

    CVPR , year=

    One-step Diffusion with Distribution Matching Distillation , author=. CVPR , year=

  7. [10]

    2022 , eprint=

    Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

  8. [11]

    ArXiv , year=

    Imagen Video: High Definition Video Generation with Diffusion Models , author=. ArXiv , year=

  9. [12]

    2023 , eprint=

    Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , author=. 2023 , eprint=

  10. [13]

    2023 , eprint=

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author=. 2023 , eprint=

  11. [14]

    2023 , eprint=

    Photorealistic Video Generation with Diffusion Models , author=. 2023 , eprint=

  12. [15]

    2023 , eprint=

    Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning , author=. 2023 , eprint=

  13. [16]

    2024 , eprint=

    Lumiere: A Space-Time Diffusion Model for Video Generation , author=. 2024 , eprint=

  14. [17]

    Auto-Encoding Variational Bayes

    Kingma, Diederik P. and Welling, Max , biburl =. 2nd International Conference on Learning Representations,. http://arxiv.org/abs/1312.6114v10 , eprintclass =

  15. [18]

    2024 , eprint=

    DINOv2: Learning Robust Visual Features without Supervision , author=. 2024 , eprint=

  16. [22]

    CVPR , year=

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author=. CVPR , year=

  17. [23]

    CoRR , volume =

    John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =

  18. [24]

    Nature , year=

    Human-level control through deep reinforcement learning , author=. Nature , year=

  19. [25]

    Deep Generative Models for Highly Structured Data,

    Thomas Unterthiner and Sjoerd van Steenkiste and Karol Kurach and Rapha. Deep Generative Models for Highly Structured Data,. 2019 , timestamp =

  20. [27]

    2020 , booktitle =

    Seung Wook Kim and Yuhao Zhou and Jonah Philion and Antonio Torralba and Sanja Fidler , title =. 2020 , booktitle =

  21. [28]

    2024 , eprint=

    Diffusion for World Modeling: Visual Details Matter in Atari , author=. 2024 , eprint=

  22. [29]

    2024 , eprint=

    Genie: Generative Interactive Environments , author=. 2024 , eprint=

  23. [30]

    2020 , eprint=

    Dream to Control: Learning Behaviors by Latent Imagination , author=. 2020 , eprint=

  24. [31]

    2018 , isbn =

    Akenine-Mller, Tomas and Haines, Eric and Hoffman, Naty , title =. 2018 , isbn =

  25. [32]

    2008 , publisher=

    Realistic Ray Tracing, Second Edition , author=. 2008 , publisher=

  26. [33]

    2020 , booktitle=

    NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , author=. 2020 , booktitle=

  27. [34]

    3D Gaussian Splatting for Real-Time Radiance Field Rendering , journal =

    Kerbl, Bernhard and Kopanas, Georgios and Leimk. 3D Gaussian Splatting for Real-Time Radiance Field Rendering , journal =. 2023 , url =

  28. [35]

    2018 , Eprint =

    Edoardo Giacomello and Pier Luca Lanzi and Daniele Loiacono , Title =. 2018 , Eprint =

  29. [36]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  30. [37]

    2018 , Eprint =

    David Ha and Jürgen Schmidhuber , Title=. 2018 , Eprint =

  31. [38]

    2024 , Eprint =

    Jake Bruce and Michael Dennis and Ashley Edwards and Jack Parker-Holder and Yuge Shi and Edward Hughes and Matthew Lai and Aditi Mavalankar and Richie Steigerwald and Chris Apps and Yusuf Aytar and Sarah Bechtle and Feryal Behbahani and Stephanie Chan and Nicolas Heess and Lucy Gonzalez and Simon Osindero and Sherjil Ozair and Scott Reed and Jingwei Zhang...

  32. [39]

    2023 , Eprint =

    Anthony Hu and Lloyd Russell and Hudson Yeo and Zak Murez and George Fedoseev and Alex Kendall and Jamie Shotton and Gianluca Corrado , Title =. 2023 , Eprint =

  33. [40]

    2023 , Eprint =

    Lucy Chai and Richard Tucker and Zhengqi Li and Phillip Isola and Noah Snavely , Title =. 2023 , Eprint =

  34. [41]

    2024 , Eprint =

    Zihan Ding and Amy Zhang and Yuandong Tian and Qinqing Zheng , Title =. 2024 , Eprint =

  35. [42]

    2018 , eprint=

    Comparison between CS and JPEG in terms of image compression , author=. 2018 , eprint=

  36. [43]

    2022 , eprint=

    Denoising Diffusion Implicit Models , author=. 2022 , eprint=

  37. [44]

    Journal of Machine Learning Research , year =

    Antonin Raffin and Ashley Hill and Adam Gleave and Anssi Kanervisto and Maximilian Ernestus and Noah Dormann , title =. Journal of Machine Learning Research , year =

  38. [46]

    2022 , eprint=

    Progressive Distillation for Fast Sampling of Diffusion Models , author=. 2022 , eprint=

  39. [48]

    2021 , eprint=

    Playable Video Generation , author=. 2021 , eprint=

  40. [49]

    2024 , url=

    Video generation models as world simulators , author=. 2024 , url=

  41. [50]

    ArXiv , year=

    GAIA-1: A Generative World Model for Autonomous Driving , author=. ArXiv , year=

  42. [51]

    2024 , eprint=

    Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , author=. 2024 , eprint=

  43. [52]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    Rolling Diffusion Models , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

  44. [53]

    2021 , eprint=

    VideoGPT: Video Generation using VQ-VAE and Transformers , author=. 2021 , eprint=

  45. [54]

    nature , volume=

    Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

  46. [55]

    Real-Time Rendering, Fourth Edition

    Tomas Akenine-Mller, Eric Haines, and Naty Hoffman. Real-Time Rendering, Fourth Edition. A. K. Peters, Ltd., USA, 4th edition, 2018. ISBN 0134997832

  47. [56]

    Diffusion for world modeling: Visual details matter in atari, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari, 2024

  48. [57]

    Lumiere: A space-time diffusion model for video generation, 2024

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation, 2024. URL https://arxiv.org/abs/2401.12945

  49. [58]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023 a . URL https://arxiv.org/abs/2311.15127

  50. [59]

    Align your latents: High-resolution video synthesis with latent diffusion models, 2023 b

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models, 2023 b . URL https://arxiv.org/abs/2304.08818

  51. [60]

    Video generation models as world simulators, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators

  52. [61]

    Genie: Generative interactive environments, 2024

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Si...

  53. [62]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024. URL https://arxiv.org/abs/2407.01392

  54. [63]

    Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023. URL https://arxiv.org/abs/2311.10709

  55. [64]

    Photorealistic video generation with diffusion models, 2023

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models, 2023. URL https://arxiv.org/abs/2312.06662

  56. [65]

    World models, 2018

    David Ha and Jürgen Schmidhuber. World models, 2018

  57. [66]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603

  58. [67]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/abs/2207.12598

  59. [68]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021

  60. [69]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022. URL https://api.semanticscholar.org/CorpusID:252715883

  61. [70]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. ArXiv, abs/2309.17080, 2023. URL https://api.semanticscholar.org/CorpusID:263310665

  62. [71]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk \"u hler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42 0 (4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  63. [72]

    Learning to Simulate Dynamic Environments with GameGAN

    Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to Simulate Dynamic Environments with GameGAN . In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020

  64. [73]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes . In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings , 2014

  65. [74]

    Playable video generation, 2021

    Willi Menapace, Stéphane Lathuilière, Sergey Tulyakov, Aliaksandr Siarohin, and Elisa Ricci. Playable video generation, 2021. URL https://arxiv.org/abs/2101.12195

  66. [75]

    Promptable game models: Text-guided game simulation via masked diffusion models

    Willi Menapace, Aliaksandr Siarohin, Stéphane Lathuilière, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, and Elisa Ricci. Promptable game models: Text-guided game simulation via masked diffusion models. ACM Transactions on Graphics, 43 0 (2): 0 1–16, January 2024. ISSN 1557-7368. doi:10.1145/3635705. URL http://dx.doi.org/10.1145/3635705

  67. [76]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

  68. [77]

    Human-level control through deep reinforcement learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518 0 (7540): 0 529--533, 2015 a

  69. [78]

    Rusu, Joel Veness, Marc G

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Kirkeby Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinfor...

  70. [79]

    Comparison between CS and JPEG in terms of image compression

    Danko Petric and Marija Milinkovic. Comparison between cs and jpeg in terms of image compression, 2018. URL https://arxiv.org/abs/1802.05114

  71. [80]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M \"u ller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

  72. [81]

    Stable-baselines3: Reliable reinforcement learning implementations

    Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22 0 (268): 0 1--8, 2021. URL http://jmlr.org/papers/v22/20-1364.html

  73. [82]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  74. [83]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

  75. [84]

    Rolling diffusion models

    David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 42818...

  76. [85]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 0 36479--36494, 2022

  77. [86]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models, 2022. URL https://arxiv.org/abs/2202.00512

  78. [87]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347

  79. [88]

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235, 2018. URL http://arxiv.org/abs/1804.04235

  80. [89]

    Shirley and R.K

    P. Shirley and R.K. Morley. Realistic Ray Tracing, Second Edition. Taylor & Francis, 2008. ISBN 9781568814612. URL https://books.google.ch/books?id=knpN6mnhJ8QC

Showing first 80 references.