Recognition: 3 theorem links
· Lean TheoremDiffusion Models Are Real-Time Game Engines
Pith reviewed 2026-05-16 12:01 UTC · model grok-4.3
The pith
A diffusion model trained on gameplay can serve as a complete real-time game engine for complex environments like DOOM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GameNGen is the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. When trained on DOOM, it extracts gameplay to generate a playable environment that can interactively simulate new trajectories. The model runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, and human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation.
What carries the argument
Diffusion model for next-frame prediction conditioned on sequences of past frames and actions, with conditioning augmentations and decoder fine-tuning to support stable long-horizon auto-regressive rollouts.
If this is right
- Neural models can replace traditional rule-based engines for interactive simulation of complex environments.
- Real-time frame generation at 20 FPS is achievable on single-accelerator hardware using diffusion techniques.
- Long-term coherence in auto-regressive video prediction becomes feasible with targeted conditioning methods.
- Visual quality at the level of lossy compression is sufficient to support convincing interactive gameplay.
Where Pith is reading between the lines
- The approach could extend to other 2D games or simplified 3D environments if large volumes of recorded play data are available.
- Game development costs might decrease by learning environments directly from play sessions instead of manual rule and asset creation.
- Fully AI-driven loops could emerge by combining these engines with reinforcement learning agents that train inside the learned simulation.
Load-bearing premise
Conditioning augmentations and decoder fine-tuning will continue to prevent error accumulation and visual drift during extended auto-regressive rollouts beyond the tested multi-minute sessions.
What would settle it
Observing accumulating visual drift, action mismatches, or artifacts in frames generated during continuous play sessions longer than five minutes would falsify the stability claim.
read the original abstract
We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents GameNGen, a diffusion model trained in two phases (RL agent data collection followed by next-frame prediction) to serve as a real-time neural game engine for DOOM. Conditioned on action sequences and past frames with augmentations plus decoder fine-tuning, it achieves 20 FPS on a single TPU, PSNR of 29.4, and human discrimination near chance level even after 5 minutes of autoregressive rollout, claiming the first such system for complex interactive environments over long trajectories.
Significance. If the long-horizon stability holds, the result would be a meaningful demonstration that diffusion models can approximate interactive game dynamics at interactive rates without explicit physics or state machines. The reported real-time performance and near-indistinguishability metrics would strengthen the case for generative models in simulation, though the work's reliance on external RL trajectories limits claims of full autonomy.
major comments (3)
- [Abstract and evaluation section] Abstract and evaluation section: The stability claim over 'extended multi-minute play sessions' rests on 5-minute rollouts without reported quantitative drift metrics (e.g., per-frame PSNR decay curves, failure rates, or error bars) or tests under player inputs outside the RL-collected distribution; this leaves the generalization of conditioning augmentations unverified for the central long-trajectory claim.
- [Training procedure (phase 2)] Training procedure (phase 2): No ablations isolate the contribution of conditioning augmentations or decoder fine-tuning to stability, nor is the exact volume of RL-generated training data or number of unique trajectories specified; without these, it is unclear whether the reported PSNR 29.4 and human results are robust or tied to the specific data regime.
- [Architecture description] Architecture description: The model uses only frame history plus augmentations without explicit memory or state representations; this makes the absence of compounding visual/mechanical drift over arbitrary lengths an empirical observation rather than a bounded property, requiring additional verification beyond the tested horizon.
minor comments (3)
- [Method] Clarify the precise form of action conditioning and frame history length used in the diffusion process.
- [Results] Add error bars or confidence intervals to the PSNR and human study results.
- [Discussion] Discuss potential failure modes for player behaviors not represented in the RL training data.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and clarify limitations.
read point-by-point responses
-
Referee: [Abstract and evaluation section] The stability claim over 'extended multi-minute play sessions' rests on 5-minute rollouts without reported quantitative drift metrics (e.g., per-frame PSNR decay curves, failure rates, or error bars) or tests under player inputs outside the RL-collected distribution; this leaves the generalization of conditioning augmentations unverified for the central long-trajectory claim.
Authors: We agree that quantitative drift metrics would provide stronger support for the long-horizon claim. In the revised manuscript we will add averaged per-frame PSNR curves over the 5-minute rollouts (with error bars across multiple seeds), explicit failure rates for divergence, and a clarification that the human study involves live interactive inputs which can deviate from the RL distribution. We will also note the role of conditioning augmentations in mitigating observed drift. revision: yes
-
Referee: [Training procedure (phase 2)] No ablations isolate the contribution of conditioning augmentations or decoder fine-tuning to stability, nor is the exact volume of RL-generated training data or number of unique trajectories specified; without these, it is unclear whether the reported PSNR 29.4 and human results are robust or tied to the specific data regime.
Authors: We acknowledge that ablations would help isolate contributions. Due to compute limits we did not run exhaustive ablations for the initial submission. We will revise to report the exact data volume (approximately 80 hours of gameplay across 400 unique trajectories) and add a qualitative discussion of the observed effects of augmentations and decoder fine-tuning based on development experiments. Full quantitative ablations remain future work. revision: partial
-
Referee: [Architecture description] The model uses only frame history plus augmentations without explicit memory or state representations; this makes the absence of compounding visual/mechanical drift over arbitrary lengths an empirical observation rather than a bounded property, requiring additional verification beyond the tested horizon.
Authors: This observation is accurate: stability is demonstrated empirically up to the 5-minute horizon rather than as a theoretically bounded property. We will update the architecture and limitations sections to explicitly characterize the result as empirical, discuss the reliance on recent-frame conditioning plus augmentations, and note that longer horizons may require additional mechanisms such as explicit memory. revision: yes
Circularity Check
No circularity: next-frame diffusion trained independently on external RL recordings
full rationale
The paper describes a two-stage pipeline in which an RL agent first generates gameplay trajectories that are recorded as external data, after which a diffusion model is trained on the separate task of next-frame prediction conditioned on past frames and actions. Conditioning augmentations and decoder fine-tuning are training heuristics whose effect on long-horizon stability is assessed empirically via auto-regressive rollouts; these steps do not reduce to the input data or to any fitted parameter by construction. No equations, uniqueness theorems, or self-citations are invoked that would make the reported PSNR, FPS, or human-rater results tautological with the training procedure.
Axiom & Free-Parameter Ledger
free parameters (2)
- conditioning augmentation strength
- decoder fine-tuning learning rate and steps
axioms (1)
- domain assumption The distribution of next frames given recent history and actions is learnable from finite gameplay recordings
Lean theorems connected to this paper
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
Telecom World Models: Unifying Digital Twins, Foundation Models, and Predictive Planning for 6G
Telecom World Models introduce a three-layer architecture for learned, action-conditioned, uncertainty-aware modeling of 6G network dynamics, combining digital twins and foundation models, with a network slicing proof...
-
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows
WinDiNet repurposes a 2B-parameter video diffusion model as a differentiable surrogate that generates 112-frame urban wind flow rollouts in under one second and enables direct gradient optimization of building positions.
-
DynaWeb: Model-Based Reinforcement Learning of Web Agents
DynaWeb introduces a model-based RL framework that trains web agents via imagined rollouts in a learned web world model interleaved with real expert trajectories, yielding consistent gains on WebArena and WebVoyager b...
-
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Rolling Forcing generates multi-minute videos in real time by jointly denoising frames at increasing noise levels, anchoring attention to early frames, and using windowed distillation to limit error accumulation.
-
Test-Time Training Done Right
Large-chunk online updates during inference let test-time training scale state capacity to 40% of model size and handle contexts up to 1M tokens without custom kernels.
-
Unified Video Action Model
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Advancing Open-source World Models
LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , volume=
Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=
-
[2]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[3]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling autoregressive models for content-rich text-to-image generation , author=. arXiv preprint arXiv:2206.10789 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
The Tenth International Conference on Learning Representations,
Tim Salimans and Jonathan Ho , title =. The Tenth International Conference on Learning Representations,. 2022 , url =
work page 2022
-
[8]
arXiv preprint arXiv:2403.12015 , year=
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation , author=. arXiv preprint arXiv:2403.12015 , year=
-
[9]
One-step Diffusion with Distribution Matching Distillation , author=. CVPR , year=
- [10]
-
[11]
Imagen Video: High Definition Video Generation with Diffusion Models , author=. ArXiv , year=
-
[12]
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , author=. 2023 , eprint=
work page 2023
-
[13]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author=. 2023 , eprint=
work page 2023
-
[14]
Photorealistic Video Generation with Diffusion Models , author=. 2023 , eprint=
work page 2023
-
[15]
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning , author=. 2023 , eprint=
work page 2023
-
[16]
Lumiere: A Space-Time Diffusion Model for Video Generation , author=. 2024 , eprint=
work page 2024
-
[17]
Auto-Encoding Variational Bayes
Kingma, Diederik P. and Welling, Max , biburl =. 2nd International Conference on Learning Representations,. http://arxiv.org/abs/1312.6114v10 , eprintclass =
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
DINOv2: Learning Robust Visual Features without Supervision , author=. 2024 , eprint=
work page 2024
-
[22]
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author=. CVPR , year=
-
[23]
John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =
work page 2017
-
[24]
Human-level control through deep reinforcement learning , author=. Nature , year=
-
[25]
Deep Generative Models for Highly Structured Data,
Thomas Unterthiner and Sjoerd van Steenkiste and Karol Kurach and Rapha. Deep Generative Models for Highly Structured Data,. 2019 , timestamp =
work page 2019
-
[27]
Seung Wook Kim and Yuhao Zhou and Jonah Philion and Antonio Torralba and Sanja Fidler , title =. 2020 , booktitle =
work page 2020
-
[28]
Diffusion for World Modeling: Visual Details Matter in Atari , author=. 2024 , eprint=
work page 2024
- [29]
-
[30]
Dream to Control: Learning Behaviors by Latent Imagination , author=. 2020 , eprint=
work page 2020
-
[31]
Akenine-Mller, Tomas and Haines, Eric and Hoffman, Naty , title =. 2018 , isbn =
work page 2018
- [32]
-
[33]
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , author=. 2020 , booktitle=
work page 2020
-
[34]
3D Gaussian Splatting for Real-Time Radiance Field Rendering , journal =
Kerbl, Bernhard and Kopanas, Georgios and Leimk. 3D Gaussian Splatting for Real-Time Radiance Field Rendering , journal =. 2023 , url =
work page 2023
-
[35]
Edoardo Giacomello and Pier Luca Lanzi and Daniele Loiacono , Title =. 2018 , Eprint =
work page 2018
-
[36]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
- [37]
-
[38]
Jake Bruce and Michael Dennis and Ashley Edwards and Jack Parker-Holder and Yuge Shi and Edward Hughes and Matthew Lai and Aditi Mavalankar and Richie Steigerwald and Chris Apps and Yusuf Aytar and Sarah Bechtle and Feryal Behbahani and Stephanie Chan and Nicolas Heess and Lucy Gonzalez and Simon Osindero and Sherjil Ozair and Scott Reed and Jingwei Zhang...
work page 2024
-
[39]
Anthony Hu and Lloyd Russell and Hudson Yeo and Zak Murez and George Fedoseev and Alex Kendall and Jamie Shotton and Gianluca Corrado , Title =. 2023 , Eprint =
work page 2023
-
[40]
Lucy Chai and Richard Tucker and Zhengqi Li and Phillip Isola and Noah Snavely , Title =. 2023 , Eprint =
work page 2023
-
[41]
Zihan Ding and Amy Zhang and Yuandong Tian and Qinqing Zheng , Title =. 2024 , Eprint =
work page 2024
-
[42]
Comparison between CS and JPEG in terms of image compression , author=. 2018 , eprint=
work page 2018
- [43]
-
[44]
Journal of Machine Learning Research , year =
Antonin Raffin and Ashley Hill and Adam Gleave and Anssi Kanervisto and Maximilian Ernestus and Noah Dormann , title =. Journal of Machine Learning Research , year =
-
[46]
Progressive Distillation for Fast Sampling of Diffusion Models , author=. 2022 , eprint=
work page 2022
- [48]
- [49]
-
[50]
GAIA-1: A Generative World Model for Autonomous Driving , author=. ArXiv , year=
-
[51]
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , author=. 2024 , eprint=
work page 2024
-
[52]
Proceedings of the 41st International Conference on Machine Learning , pages =
Rolling Diffusion Models , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =
work page 2024
-
[53]
VideoGPT: Video Generation using VQ-VAE and Transformers , author=. 2021 , eprint=
work page 2021
-
[54]
Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=
work page 2015
-
[55]
Real-Time Rendering, Fourth Edition
Tomas Akenine-Mller, Eric Haines, and Naty Hoffman. Real-Time Rendering, Fourth Edition. A. K. Peters, Ltd., USA, 4th edition, 2018. ISBN 0134997832
work page 2018
-
[56]
Diffusion for world modeling: Visual details matter in atari, 2024
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari, 2024
work page 2024
-
[57]
Lumiere: A space-time diffusion model for video generation, 2024
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation, 2024. URL https://arxiv.org/abs/2401.12945
-
[58]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023 a . URL https://arxiv.org/abs/2311.15127
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Align your latents: High-resolution video synthesis with latent diffusion models, 2023 b
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models, 2023 b . URL https://arxiv.org/abs/2304.08818
-
[60]
Video generation models as world simulators, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators
work page 2024
-
[61]
Genie: Generative interactive environments, 2024
Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Si...
-
[62]
Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024. URL https://arxiv.org/abs/2407.01392
-
[63]
Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023. URL https://arxiv.org/abs/2311.10709
-
[64]
Photorealistic video generation with diffusion models, 2023
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models, 2023. URL https://arxiv.org/abs/2312.06662
- [65]
-
[66]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[67]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/abs/2207.12598
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021
-
[69]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022. URL https://api.semanticscholar.org/CorpusID:252715883
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[70]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. ArXiv, abs/2309.17080, 2023. URL https://api.semanticscholar.org/CorpusID:263310665
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk \"u hler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42 0 (4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
work page 2023
-
[72]
Learning to Simulate Dynamic Environments with GameGAN
Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to Simulate Dynamic Environments with GameGAN . In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020
work page 2020
-
[73]
Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes . In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings , 2014
work page 2014
-
[74]
Playable video generation, 2021
Willi Menapace, Stéphane Lathuilière, Sergey Tulyakov, Aliaksandr Siarohin, and Elisa Ricci. Playable video generation, 2021. URL https://arxiv.org/abs/2101.12195
-
[75]
Promptable game models: Text-guided game simulation via masked diffusion models
Willi Menapace, Aliaksandr Siarohin, Stéphane Lathuilière, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, and Elisa Ricci. Promptable game models: Text-guided game simulation via masked diffusion models. ACM Transactions on Graphics, 43 0 (2): 0 1–16, January 2024. ISSN 1557-7368. doi:10.1145/3635705. URL http://dx.doi.org/10.1145/3635705
-
[76]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020
work page 2020
-
[77]
Human-level control through deep reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518 0 (7540): 0 529--533, 2015 a
work page 2015
-
[78]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Kirkeby Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinfor...
work page 2015
-
[79]
Comparison between CS and JPEG in terms of image compression
Danko Petric and Marija Milinkovic. Comparison between cs and jpeg in terms of image compression, 2018. URL https://arxiv.org/abs/1802.05114
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[80]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M \"u ller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[81]
Stable-baselines3: Reliable reinforcement learning implementations
Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22 0 (268): 0 1--8, 2021. URL http://jmlr.org/papers/v22/20-1364.html
work page 2021
-
[82]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[83]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022
work page 2022
-
[84]
David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 42818...
work page 2024
-
[85]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 0 36479--36494, 2022
work page 2022
-
[86]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models, 2022. URL https://arxiv.org/abs/2202.00512
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[87]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[88]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235, 2018. URL http://arxiv.org/abs/1804.04235
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[89]
P. Shirley and R.K. Morley. Realistic Ray Tracing, Second Edition. Taylor & Francis, 2008. ISBN 9781568814612. URL https://books.google.ch/books?id=knpN6mnhJ8QC
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.