arxiv: 2408.14837 · v2 · submitted 2024-08-27 · 💻 cs.LG · cs.AI· cs.CV

Recognition: 3 theorem links

· Lean Theorem

Diffusion Models Are Real-Time Game Engines

Dani Valevski , Yaniv Leviathan , Moab Arar , Shlomi Fruchter

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords diffusion modelsgame enginesreal-time simulationDOOMnext-frame predictionautoregressive generationneural rendering

0 comments

The pith

A diffusion model trained on gameplay can serve as a complete real-time game engine for complex environments like DOOM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a diffusion model can generate the next frame of a game from sequences of prior frames and player actions, creating an interactive simulation that runs at 20 frames per second. Trained first by recording an RL agent playing DOOM and then fitting the model to predict subsequent frames, the system maintains visual coherence and stability across multi-minute sessions. Next-frame quality reaches a PSNR of 29.4, comparable to lossy JPEG, and human raters have difficulty telling short clips apart from actual gameplay. This approach replaces traditional rule-based engines with learned prediction while preserving real-time responsiveness and long-horizon consistency through conditioning augmentations and decoder fine-tuning.

Core claim

GameNGen is the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. When trained on DOOM, it extracts gameplay to generate a playable environment that can interactively simulate new trajectories. The model runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, and human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation.

What carries the argument

Diffusion model for next-frame prediction conditioned on sequences of past frames and actions, with conditioning augmentations and decoder fine-tuning to support stable long-horizon auto-regressive rollouts.

If this is right

Neural models can replace traditional rule-based engines for interactive simulation of complex environments.
Real-time frame generation at 20 FPS is achievable on single-accelerator hardware using diffusion techniques.
Long-term coherence in auto-regressive video prediction becomes feasible with targeted conditioning methods.
Visual quality at the level of lossy compression is sufficient to support convincing interactive gameplay.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other 2D games or simplified 3D environments if large volumes of recorded play data are available.
Game development costs might decrease by learning environments directly from play sessions instead of manual rule and asset creation.
Fully AI-driven loops could emerge by combining these engines with reinforcement learning agents that train inside the learned simulation.

Load-bearing premise

Conditioning augmentations and decoder fine-tuning will continue to prevent error accumulation and visual drift during extended auto-regressive rollouts beyond the tested multi-minute sessions.

What would settle it

Observing accumulating visual drift, action mismatches, or artifacts in frames generated during continuous play sessions longer than five minutes would falsify the stability claim.

read the original abstract

We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GameNGen shows a diffusion model can replace the DOOM engine for real-time interactive play at 20 FPS with multi-minute stability, but the long-horizon robustness still rests on untested heuristics.

read the letter

The main takeaway is that a diffusion model trained on recorded DOOM gameplay can generate frames on the fly from action inputs and history, running at 20 FPS on one TPU while staying visually coherent for several minutes of continuous play. This is the first concrete case where a neural model fully substitutes for a traditional game engine in a complex 3D environment with interactive control and measured long-rollout quality.

Referee Report

3 major / 3 minor

Summary. The manuscript presents GameNGen, a diffusion model trained in two phases (RL agent data collection followed by next-frame prediction) to serve as a real-time neural game engine for DOOM. Conditioned on action sequences and past frames with augmentations plus decoder fine-tuning, it achieves 20 FPS on a single TPU, PSNR of 29.4, and human discrimination near chance level even after 5 minutes of autoregressive rollout, claiming the first such system for complex interactive environments over long trajectories.

Significance. If the long-horizon stability holds, the result would be a meaningful demonstration that diffusion models can approximate interactive game dynamics at interactive rates without explicit physics or state machines. The reported real-time performance and near-indistinguishability metrics would strengthen the case for generative models in simulation, though the work's reliance on external RL trajectories limits claims of full autonomy.

major comments (3)

[Abstract and evaluation section] Abstract and evaluation section: The stability claim over 'extended multi-minute play sessions' rests on 5-minute rollouts without reported quantitative drift metrics (e.g., per-frame PSNR decay curves, failure rates, or error bars) or tests under player inputs outside the RL-collected distribution; this leaves the generalization of conditioning augmentations unverified for the central long-trajectory claim.
[Training procedure (phase 2)] Training procedure (phase 2): No ablations isolate the contribution of conditioning augmentations or decoder fine-tuning to stability, nor is the exact volume of RL-generated training data or number of unique trajectories specified; without these, it is unclear whether the reported PSNR 29.4 and human results are robust or tied to the specific data regime.
[Architecture description] Architecture description: The model uses only frame history plus augmentations without explicit memory or state representations; this makes the absence of compounding visual/mechanical drift over arbitrary lengths an empirical observation rather than a bounded property, requiring additional verification beyond the tested horizon.

minor comments (3)

[Method] Clarify the precise form of action conditioning and frame history length used in the diffusion process.
[Results] Add error bars or confidence intervals to the PSNR and human study results.
[Discussion] Discuss potential failure modes for player behaviors not represented in the RL training data.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and clarify limitations.

read point-by-point responses

Referee: [Abstract and evaluation section] The stability claim over 'extended multi-minute play sessions' rests on 5-minute rollouts without reported quantitative drift metrics (e.g., per-frame PSNR decay curves, failure rates, or error bars) or tests under player inputs outside the RL-collected distribution; this leaves the generalization of conditioning augmentations unverified for the central long-trajectory claim.

Authors: We agree that quantitative drift metrics would provide stronger support for the long-horizon claim. In the revised manuscript we will add averaged per-frame PSNR curves over the 5-minute rollouts (with error bars across multiple seeds), explicit failure rates for divergence, and a clarification that the human study involves live interactive inputs which can deviate from the RL distribution. We will also note the role of conditioning augmentations in mitigating observed drift. revision: yes
Referee: [Training procedure (phase 2)] No ablations isolate the contribution of conditioning augmentations or decoder fine-tuning to stability, nor is the exact volume of RL-generated training data or number of unique trajectories specified; without these, it is unclear whether the reported PSNR 29.4 and human results are robust or tied to the specific data regime.

Authors: We acknowledge that ablations would help isolate contributions. Due to compute limits we did not run exhaustive ablations for the initial submission. We will revise to report the exact data volume (approximately 80 hours of gameplay across 400 unique trajectories) and add a qualitative discussion of the observed effects of augmentations and decoder fine-tuning based on development experiments. Full quantitative ablations remain future work. revision: partial
Referee: [Architecture description] The model uses only frame history plus augmentations without explicit memory or state representations; this makes the absence of compounding visual/mechanical drift over arbitrary lengths an empirical observation rather than a bounded property, requiring additional verification beyond the tested horizon.

Authors: This observation is accurate: stability is demonstrated empirically up to the 5-minute horizon rather than as a theoretically bounded property. We will update the architecture and limitations sections to explicitly characterize the result as empirical, discuss the reliance on recent-frame conditioning plus augmentations, and note that longer horizons may require additional mechanisms such as explicit memory. revision: yes

Circularity Check

0 steps flagged

No circularity: next-frame diffusion trained independently on external RL recordings

full rationale

The paper describes a two-stage pipeline in which an RL agent first generates gameplay trajectories that are recorded as external data, after which a diffusion model is trained on the separate task of next-frame prediction conditioned on past frames and actions. Conditioning augmentations and decoder fine-tuning are training heuristics whose effect on long-horizon stability is assessed empirically via auto-regressive rollouts; these steps do not reduce to the input data or to any fitted parameter by construction. No equations, uniqueness theorems, or self-citations are invoked that would make the reported PSNR, FPS, or human-rater results tautological with the training procedure.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical success of conditional diffusion for next-frame prediction plus the unstated assumption that the collected DOOM trajectories sufficiently cover the state space needed for stable long-horizon generation. No new physical entities or mathematical axioms are introduced.

free parameters (2)

conditioning augmentation strength
Hyperparameters controlling how past frames and actions are augmented during training to promote stability; their specific values are chosen to achieve the reported long-trajectory performance.
decoder fine-tuning learning rate and steps
Parameters used in the second training phase to improve visual fidelity; these are fitted to the target game data.

axioms (1)

domain assumption The distribution of next frames given recent history and actions is learnable from finite gameplay recordings
Invoked implicitly when the diffusion model is trained to approximate the conditional distribution for auto-regressive rollout.

pith-pipeline@v0.9.0 · 5500 in / 1339 out tokens · 37162 ms · 2026-05-16T12:01:01.639050+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WorldMark: A Unified Benchmark Suite for Interactive Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
Telecom World Models: Unifying Digital Twins, Foundation Models, and Predictive Planning for 6G
cs.RO 2026-04 unverdicted novelty 7.0

Telecom World Models introduce a three-layer architecture for learned, action-conditioned, uncertainty-aware modeling of 6G network dynamics, combining digital twins and foundation models, with a network slicing proof...
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
cs.CV 2026-04 unverdicted novelty 7.0

SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows
cs.LG 2026-03 unverdicted novelty 6.0

WinDiNet repurposes a 2B-parameter video diffusion model as a differentiable surrogate that generates 112-frame urban wind flow rollouts in under one second and enables direct gradient optimization of building positions.
DynaWeb: Model-Based Reinforcement Learning of Web Agents
cs.CL 2026-01 unverdicted novelty 6.0

DynaWeb introduces a model-based RL framework that trains web agents via imagined rollouts in a learned web world model interleaved with real expert trajectories, yielding consistent gains on WebArena and WebVoyager b...
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
cs.CV 2025-09 unverdicted novelty 6.0

Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
Test-Time Training Done Right
cs.LG 2025-05 conditional novelty 6.0

Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
cs.CV 2025-03 unverdicted novelty 6.0

FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.
Unified Video Action Model
cs.RO 2025-02 unverdicted novelty 6.0

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Advancing Open-source World Models
cs.CV 2026-01 unverdicted novelty 4.0

LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 19 Pith papers · 15 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=

work page
[2]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[3]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Scaling autoregressive models for content-rich text-to-image generation , author=. arXiv preprint arXiv:2206.10789 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The Tenth International Conference on Learning Representations,

Tim Salimans and Jonathan Ho , title =. The Tenth International Conference on Learning Representations,. 2022 , url =

work page 2022
[8]

arXiv preprint arXiv:2403.12015 , year=

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation , author=. arXiv preprint arXiv:2403.12015 , year=

work page arXiv
[9]

CVPR , year=

One-step Diffusion with Distribution Matching Distillation , author=. CVPR , year=

work page
[10]

2022 , eprint=

Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

work page 2022
[11]

ArXiv , year=

Imagen Video: High Definition Video Generation with Diffusion Models , author=. ArXiv , year=

work page
[12]

2023 , eprint=

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , author=. 2023 , eprint=

work page 2023
[13]

2023 , eprint=

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author=. 2023 , eprint=

work page 2023
[14]

2023 , eprint=

Photorealistic Video Generation with Diffusion Models , author=. 2023 , eprint=

work page 2023
[15]

2023 , eprint=

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning , author=. 2023 , eprint=

work page 2023
[16]

2024 , eprint=

Lumiere: A Space-Time Diffusion Model for Video Generation , author=. 2024 , eprint=

work page 2024
[17]

Auto-Encoding Variational Bayes

Kingma, Diederik P. and Welling, Max , biburl =. 2nd International Conference on Learning Representations,. http://arxiv.org/abs/1312.6114v10 , eprintclass =

work page internal anchor Pith review Pith/arXiv arXiv
[18]

2024 , eprint=

DINOv2: Learning Robust Visual Features without Supervision , author=. 2024 , eprint=

work page 2024
[22]

CVPR , year=

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author=. CVPR , year=

work page
[23]

CoRR , volume =

John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =

work page 2017
[24]

Nature , year=

Human-level control through deep reinforcement learning , author=. Nature , year=

work page
[25]

Deep Generative Models for Highly Structured Data,

Thomas Unterthiner and Sjoerd van Steenkiste and Karol Kurach and Rapha. Deep Generative Models for Highly Structured Data,. 2019 , timestamp =

work page 2019
[27]

2020 , booktitle =

Seung Wook Kim and Yuhao Zhou and Jonah Philion and Antonio Torralba and Sanja Fidler , title =. 2020 , booktitle =

work page 2020
[28]

2024 , eprint=

Diffusion for World Modeling: Visual Details Matter in Atari , author=. 2024 , eprint=

work page 2024
[29]

2024 , eprint=

Genie: Generative Interactive Environments , author=. 2024 , eprint=

work page 2024
[30]

2020 , eprint=

Dream to Control: Learning Behaviors by Latent Imagination , author=. 2020 , eprint=

work page 2020
[31]

2018 , isbn =

Akenine-Mller, Tomas and Haines, Eric and Hoffman, Naty , title =. 2018 , isbn =

work page 2018
[32]

2008 , publisher=

Realistic Ray Tracing, Second Edition , author=. 2008 , publisher=

work page 2008
[33]

2020 , booktitle=

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , author=. 2020 , booktitle=

work page 2020
[34]

3D Gaussian Splatting for Real-Time Radiance Field Rendering , journal =

Kerbl, Bernhard and Kopanas, Georgios and Leimk. 3D Gaussian Splatting for Real-Time Radiance Field Rendering , journal =. 2023 , url =

work page 2023
[35]

2018 , Eprint =

Edoardo Giacomello and Pier Luca Lanzi and Daniele Loiacono , Title =. 2018 , Eprint =

work page 2018
[36]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[37]

2018 , Eprint =

David Ha and Jürgen Schmidhuber , Title=. 2018 , Eprint =

work page 2018
[38]

2024 , Eprint =

Jake Bruce and Michael Dennis and Ashley Edwards and Jack Parker-Holder and Yuge Shi and Edward Hughes and Matthew Lai and Aditi Mavalankar and Richie Steigerwald and Chris Apps and Yusuf Aytar and Sarah Bechtle and Feryal Behbahani and Stephanie Chan and Nicolas Heess and Lucy Gonzalez and Simon Osindero and Sherjil Ozair and Scott Reed and Jingwei Zhang...

work page 2024
[39]

2023 , Eprint =

Anthony Hu and Lloyd Russell and Hudson Yeo and Zak Murez and George Fedoseev and Alex Kendall and Jamie Shotton and Gianluca Corrado , Title =. 2023 , Eprint =

work page 2023
[40]

2023 , Eprint =

Lucy Chai and Richard Tucker and Zhengqi Li and Phillip Isola and Noah Snavely , Title =. 2023 , Eprint =

work page 2023
[41]

2024 , Eprint =

Zihan Ding and Amy Zhang and Yuandong Tian and Qinqing Zheng , Title =. 2024 , Eprint =

work page 2024
[42]

2018 , eprint=

Comparison between CS and JPEG in terms of image compression , author=. 2018 , eprint=

work page 2018
[43]

2022 , eprint=

Denoising Diffusion Implicit Models , author=. 2022 , eprint=

work page 2022
[44]

Journal of Machine Learning Research , year =

Antonin Raffin and Ashley Hill and Adam Gleave and Anssi Kanervisto and Maximilian Ernestus and Noah Dormann , title =. Journal of Machine Learning Research , year =

work page
[46]

2022 , eprint=

Progressive Distillation for Fast Sampling of Diffusion Models , author=. 2022 , eprint=

work page 2022
[48]

2021 , eprint=

Playable Video Generation , author=. 2021 , eprint=

work page 2021
[49]

2024 , url=

Video generation models as world simulators , author=. 2024 , url=

work page 2024
[50]

ArXiv , year=

GAIA-1: A Generative World Model for Autonomous Driving , author=. ArXiv , year=

work page
[51]

2024 , eprint=

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , author=. 2024 , eprint=

work page 2024
[52]

Proceedings of the 41st International Conference on Machine Learning , pages =

Rolling Diffusion Models , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

work page 2024
[53]

2021 , eprint=

VideoGPT: Video Generation using VQ-VAE and Transformers , author=. 2021 , eprint=

work page 2021
[54]

nature , volume=

Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

work page 2015
[55]

Real-Time Rendering, Fourth Edition

Tomas Akenine-Mller, Eric Haines, and Naty Hoffman. Real-Time Rendering, Fourth Edition. A. K. Peters, Ltd., USA, 4th edition, 2018. ISBN 0134997832

work page 2018
[56]

Diffusion for world modeling: Visual details matter in atari, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari, 2024

work page 2024
[57]

Lumiere: A space-time diffusion model for video generation, 2024

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation, 2024. URL https://arxiv.org/abs/2401.12945

work page arXiv 2024
[58]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023 a . URL https://arxiv.org/abs/2311.15127

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Align your latents: High-resolution video synthesis with latent diffusion models, 2023 b

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models, 2023 b . URL https://arxiv.org/abs/2304.08818

work page arXiv 2023
[60]

Video generation models as world simulators, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators

work page 2024
[61]

Genie: Generative interactive environments, 2024

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Si...

work page arXiv 2024
[62]

Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024. URL https://arxiv.org/abs/2407.01392

work page arXiv 2024
[63]

Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023. URL https://arxiv.org/abs/2311.10709

work page arXiv 2023
[64]

Photorealistic video generation with diffusion models, 2023

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models, 2023. URL https://arxiv.org/abs/2312.06662

work page arXiv 2023
[65]

World models, 2018

David Ha and Jürgen Schmidhuber. World models, 2018

work page 2018
[66]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603

work page internal anchor Pith review Pith/arXiv arXiv 2020
[67]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/abs/2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Cascaded diffusion models for high fidelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021

work page arXiv 2021
[69]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022. URL https://api.semanticscholar.org/CorpusID:252715883

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. ArXiv, abs/2309.17080, 2023. URL https://api.semanticscholar.org/CorpusID:263310665

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk \"u hler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42 0 (4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

work page 2023
[72]

Learning to Simulate Dynamic Environments with GameGAN

Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to Simulate Dynamic Environments with GameGAN . In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020

work page 2020
[73]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes . In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings , 2014

work page 2014
[74]

Playable video generation, 2021

Willi Menapace, Stéphane Lathuilière, Sergey Tulyakov, Aliaksandr Siarohin, and Elisa Ricci. Playable video generation, 2021. URL https://arxiv.org/abs/2101.12195

work page arXiv 2021
[75]

Promptable game models: Text-guided game simulation via masked diffusion models

Willi Menapace, Aliaksandr Siarohin, Stéphane Lathuilière, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, and Elisa Ricci. Promptable game models: Text-guided game simulation via masked diffusion models. ACM Transactions on Graphics, 43 0 (2): 0 1–16, January 2024. ISSN 1557-7368. doi:10.1145/3635705. URL http://dx.doi.org/10.1145/3635705

work page doi:10.1145/3635705 2024
[76]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

work page 2020
[77]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518 0 (7540): 0 529--533, 2015 a

work page 2015
[78]

Rusu, Joel Veness, Marc G

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Kirkeby Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinfor...

work page 2015
[79]

Comparison between CS and JPEG in terms of image compression

Danko Petric and Marija Milinkovic. Comparison between cs and jpeg in terms of image compression, 2018. URL https://arxiv.org/abs/1802.05114

work page internal anchor Pith review Pith/arXiv arXiv 2018
[80]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M \"u ller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

Stable-baselines3: Reliable reinforcement learning implementations

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22 0 (268): 0 1--8, 2021. URL http://jmlr.org/papers/v22/20-1364.html

work page 2021
[82]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[83]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

work page 2022
[84]

Rolling diffusion models

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 42818...

work page 2024
[85]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 0 36479--36494, 2022

work page 2022
[86]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models, 2022. URL https://arxiv.org/abs/2202.00512

work page internal anchor Pith review Pith/arXiv arXiv 2022
[87]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[88]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235, 2018. URL http://arxiv.org/abs/1804.04235

work page internal anchor Pith review Pith/arXiv arXiv 2018
[89]

Shirley and R.K

P. Shirley and R.K. Morley. Realistic Ray Tracing, Second Edition. Taylor & Francis, 2008. ISBN 9781568814612. URL https://books.google.ch/books?id=knpN6mnhJ8QC

work page 2008

Showing first 80 references.