pith. machine review for the scientific record. sign in

arxiv: 2512.14614 · v1 · submitted 2025-12-16 · 💻 cs.CV · cs.GR

Recognition: 3 theorem links

· Lean Theorem

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:25 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords video diffusionworld modelinggeometric consistencystreaming generationcontext memoryreal-time videointeractive simulationaction conditioning
0
0 comments X

The pith

WorldPlay generates long-horizon 720p video at 24 FPS while preserving geometric consistency through rebuilt context memory in a streaming diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WorldPlay to resolve the speed versus memory trade-off that has limited interactive world modeling. It combines a dual action representation for keyboard and mouse control with two techniques that keep distant past frames available: a memory module that reconstitutes context on the fly and a distillation step that forces alignment between a full-context teacher and a fast student. A sympathetic reader would care because this setup produces interactive scenes that stay coherent over many seconds instead of drifting into geometric nonsense. If the approach holds, it removes a major barrier to real-time simulation of 3D environments directly from video.

Core claim

WorldPlay is a streaming video diffusion model that produces real-time interactive world models with long-term geometric consistency. It rests on three components: a dual action representation that converts user keyboard and mouse inputs into robust control signals, a Reconstituted Context Memory that dynamically rebuilds past frames and applies temporal reframing to retain geometrically critical information, and Context Forcing, a distillation process that aligns memory usage between teacher and student so the faster model does not lose long-range awareness. Together these allow generation of 720p video at 24 frames per second across long horizons while reducing error accumulation.

What carries the argument

Reconstituted Context Memory, which dynamically rebuilds and reframes past frames to keep long-range geometric details accessible, paired with Context Forcing distillation that preserves the student's ability to use that memory at real-time speed.

If this is right

  • The model produces 720p streaming video at 24 FPS with better geometric consistency than prior methods.
  • It maintains coherence across long interaction horizons without visible drift.
  • It generalizes to a wide range of scenes without retraining.
  • It achieves real-time inference while still using information from frames that would otherwise be forgotten.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the memory alignment technique works, similar distillation could be applied to other streaming generative models that currently suffer from state forgetting.
  • The same context-rebuilding pattern might allow interactive 3D reconstruction pipelines to operate without maintaining full voxel or mesh histories.
  • Success here would suggest that explicit temporal reframing can substitute for ever-larger context windows in video models.

Load-bearing premise

The memory-rebuilding and distillation steps actually retain accurate long-range geometry across hundreds of frames without introducing new distortions or needing prohibitive storage.

What would settle it

Side-by-side measurement of object positions and surface normals in generated 720p sequences versus ground-truth geometry after 500 or more frames; any systematic increase in positional error beyond a few pixels would falsify the consistency claim.

read the original abstract

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This paper presents WorldPlay, a streaming video diffusion model for real-time interactive world modeling that achieves long-term geometric consistency. It introduces three innovations: a Dual Action Representation for handling user keyboard/mouse inputs, Reconstituted Context Memory that dynamically rebuilds context from past frames with temporal reframing to maintain access to geometrically important long-past frames, and Context Forcing, a distillation technique that aligns memory context between teacher and student models to preserve long-range information while enabling real-time inference. The method claims to generate long-horizon 720p video at 24 FPS with superior consistency to existing techniques and strong generalization across diverse scenes.

Significance. If the central claims are substantiated, this would represent a meaningful advance in interactive world modeling by resolving the speed-memory trade-off in streaming video diffusion models. The approach of combining dynamic memory reconstitution with teacher-student alignment for drift prevention could enable practical real-time applications in simulation and VR, and the provided project page with online demo supports reproducibility and immediate usability.

major comments (3)
  1. [Evaluation] The central claim of long-term geometric consistency without cumulative drift rests on Reconstituted Context Memory and Context Forcing, yet the evaluation provides no quantitative long-horizon metrics such as camera-pose drift, point-cloud alignment, or reprojection error tracked across thousands of frames; comparisons appear restricted to short clips or qualitative visuals, leaving the drift-prevention assumption unverified.
  2. [Method] The Reconstituted Context Memory description (dynamic rebuild + temporal reframing) does not specify implementation details such as memory buffer size, exact reframing procedure, or how geometric information is prioritized, making it impossible to assess whether these steps actually counteract diffusion-model error accumulation in streaming mode without introducing new artifacts.
  3. [Method] Context Forcing is presented as the key mechanism for preserving long-range capacity at real-time speeds, but no ablation studies isolate its contribution (e.g., performance with vs. without distillation) or report memory-aware alignment metrics, so its role in preventing error drift remains unquantified.
minor comments (2)
  1. [Abstract] The abstract states that the method 'compares favorably with existing techniques' but does not name the specific baselines or metrics used; adding these would improve clarity.
  2. [Figures] Figure captions and qualitative examples would benefit from explicit frame counts or sequence lengths to allow readers to judge the 'long-horizon' scope directly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the opportunity to address the concerns regarding evaluation metrics, implementation details, and ablation studies. We respond point-by-point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] The central claim of long-term geometric consistency without cumulative drift rests on Reconstituted Context Memory and Context Forcing, yet the evaluation provides no quantitative long-horizon metrics such as camera-pose drift, point-cloud alignment, or reprojection error tracked across thousands of frames; comparisons appear restricted to short clips or qualitative visuals, leaving the drift-prevention assumption unverified.

    Authors: We agree that quantitative long-horizon metrics would provide stronger verification of the drift-prevention claims. While the manuscript includes qualitative results over extended sequences and short-term quantitative comparisons, we will add new experiments in the revised version reporting camera-pose drift, point-cloud alignment, and reprojection error over thousands of frames to directly address this gap. revision: yes

  2. Referee: [Method] The Reconstituted Context Memory description (dynamic rebuild + temporal reframing) does not specify implementation details such as memory buffer size, exact reframing procedure, or how geometric information is prioritized, making it impossible to assess whether these steps actually counteract diffusion-model error accumulation in streaming mode without introducing new artifacts.

    Authors: We thank the referee for highlighting this lack of detail. We will expand the method section in the revision to specify the memory buffer size, the exact temporal reframing procedure (including how frames are selected and reframed), and the prioritization mechanism for geometrically important long-past frames, enabling readers to evaluate its effectiveness against error accumulation. revision: yes

  3. Referee: [Method] Context Forcing is presented as the key mechanism for preserving long-range capacity at real-time speeds, but no ablation studies isolate its contribution (e.g., performance with vs. without distillation) or report memory-aware alignment metrics, so its role in preventing error drift remains unquantified.

    Authors: We acknowledge that dedicated ablations are needed to isolate Context Forcing's contribution. We will include additional ablation experiments in the revised manuscript comparing performance with and without the distillation step, along with memory-aware alignment metrics, to quantify its role in maintaining long-range information and preventing drift. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on proposed empirical innovations

full rationale

The paper presents WorldPlay via three explicitly novel components (Dual Action Representation, Reconstituted Context Memory with temporal reframing, and Context Forcing distillation) whose descriptions do not reduce to self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. No equations appear in the abstract or summary, and the long-horizon consistency claim is framed as an outcome of these methods rather than a tautological restatement of inputs. The derivation chain is therefore self-contained as a proposal of new techniques evaluated against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all technical details remain at the level of named components without equations or fitting procedures.

pith-pipeline@v0.9.0 · 5535 in / 1030 out tokens · 45741 ms · 2026-05-15T14:25:17.859701+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/DimensionForcing alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  2. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  3. WorldMark: A Unified Benchmark Suite for Interactive Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

  4. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  5. Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

    cs.CV 2026-05 unverdicted novelty 6.0

    Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.

  6. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  7. ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...

  8. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  9. From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

    cs.CV 2026-04 unverdicted novelty 6.0

    Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

  10. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  11. UNICA: A Unified Neural Framework for Controllable 3D Avatars

    cs.CV 2026-04 unverdicted novelty 6.0

    UNICA unifies motion planning, rigging, physical simulation, and rendering into a single skeleton-free neural framework that produces next-frame 3D avatar geometry from action inputs and renders it with Gaussian splatting.

  12. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  13. SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

    cs.CV 2026-05 unverdicted novelty 5.0

    SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

  14. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  15. InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

    cs.CV 2026-03 unverdicted novelty 5.0

    InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretra...

  16. HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

    cs.CV 2026-04 unverdicted novelty 4.0

    HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...

  17. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

  18. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

  19. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

  20. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 19 Pith papers · 13 internal anchors

  1. [1]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- ervisto, Amos J Storkey, Tim Pearce, and Franc ¸ois Fleuret. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37: 58757–58791, 2024. 2

  2. [2]

    Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers. InCVPR, pages 22875–22889, 2025

  3. [3]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InCVPR, pages 15791–15801, 2025. 2

  4. [4]

    Uni3c: Unifying precisely 3d-enhanced camera and hu- man motion controls for video generation.arXiv preprint arXiv:2504.14899, 2025

    Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and hu- man motion controls for video generation.arXiv preprint arXiv:2504.14899, 2025. 2

  5. [5]

    Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

    Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024. 3

  6. [6]

    Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2, 3

  7. [7]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024. 2

  8. [8]

    Self- forcing++: Towards minute-scale high-quality video genera- tion.arXiv preprint arXiv:2510.02283, 2025

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self- forcing++: Towards minute-scale high-quality video genera- tion.arXiv preprint arXiv:2510.02283, 2025. 5

  9. [9]

    Oasis: A universe in a transformer.https: //oasis-model.github.io/, 2024

    Etched Decart. Oasis: A universe in a transformer.https: //oasis-model.github.io/, 2024. 2, 3, 4

  10. [10]

    Veo3 video model, 2025.https:// deepmind.google/models/veo/

    Google Deepmind. Veo3 video model, 2025.https:// deepmind.google/models/veo/. 2

  11. [11]

    Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 6

  12. [12]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 3

  13. [13]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

  14. [14]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step genera- tive modeling.arXiv preprint arXiv:2505.13447, 2025. 3

  15. [15]

    Animatediff: Animate your personalized text-to- image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. InICLR,

  16. [16]

    Cameractrl: Enabling camera control for text-to-video generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025. 2, 7

  17. [17]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025. 2, 3, 4, 7

  18. [18]

    Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text. InCVPR, pages 2568–2577, 2025. 2

  19. [19]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  20. [20]

    Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 7, 2

  21. [21]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train- test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2, 3, 5, 1

  22. [22]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 5, 6

  23. [23]

    Hunyuanworld 1.0: Generating im- mersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

    Team HunyuanWorld. Hunyuanworld 1.0: Generating im- mersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025. 3

  24. [24]

    Distilling diffusion models into condi- tional gans

    Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, and Taesung Park. Distilling diffusion models into condi- tional gans. InECCV, pages 428–447. Springer, 2024. 3

  25. [25]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  26. [26]

    Fifo-diffusion: Generating infinite videos from text without 9 training.Advances in Neural Information Processing Sys- tems, 37:89834–89868, 2024

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without 9 training.Advances in Neural Information Processing Sys- tems, 37:89834–89868, 2024. 2

  27. [27]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

  28. [28]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3, 1

  29. [29]

    Eschernet: A generative model for scalable view synthesis

    Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xi- aojuan Qi, and Andrew J Davison. Eschernet: A generative model for scalable view synthesis. InCVPR, pages 9503– 9513, 2024. 2

  30. [30]

    Kling video model, 2024.https : / / klingai.com/global/

    Kuaishou. Kling video model, 2024.https : / / klingai.com/global/. 2

  31. [31]

    Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

    Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025. 2, 3, 7

  32. [32]

    Vmem: Consistent interactive video scene generation with surfel-indexed view memory

    Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InICCV, 2025. 2, 3, 7

  33. [33]

    Cameras as relative positional encoding

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. arXiv preprint arXiv:2507.10496, 2025. 2, 4

  34. [34]

    Sequence parallelism: Long sequence train- ing from system perspective

    Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence train- ing from system perspective. InProceedings of the 61st An- nual Meeting of the Association for Computational Linguis- tics (V olume 1: Long Papers), pages 2391–2404, Toronto, Canada, 2023. Association for Computational Linguistics. 6

  35. [35]

    Flashworld: High- quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678, 2025

    Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High- quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678, 2025. 3

  36. [36]

    Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

    Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025. 7, 2

  37. [37]

    Sdxl- lightning: Progressive adversarial diffusion distillation

    Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl- lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024. 3

  38. [38]

    Diffusion adversarial post-training for one-step video generation

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. 2025

  39. [39]

    Autoregressive adversarial post-training for real-time inter- active video generation.arXiv preprint arXiv:2506.09350,

    Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time inter- active video generation.arXiv preprint arXiv:2506.09350,

  40. [40]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, pages 22160–22169,

  41. [41]

    Flow matching for generative mod- eling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InICLR, 2023. 2, 3

  42. [42]

    Re- conx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Re- conx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024. 2

  43. [43]

    Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025. 5

  44. [44]

    World- mirror: Universal 3d world reconstruction with any-prior prompting.arXiv preprint arXiv:2510.10726, 2025

    Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. World- mirror: Universal 3d world reconstruction with any-prior prompting.arXiv preprint arXiv:2510.10726, 2025. 8

  45. [45]

    Adversarial distribution matching for diffusion distilla- tion towards efficient image and video synthesis

    Yanzuo Lu, Yuxi Ren, Xin Xia, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Andy J Ma, Xiaohua Xie, and Jian-Huang Lai. Adversarial distribution matching for diffusion distilla- tion towards efficient image and video synthesis. InICCV, pages 16818–16829, 2025. 3

  46. [46]

    Yume: An interactive world gen- eration model.arXiv preprint arXiv:2507.17744, 2025

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wen- shuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world gen- eration model.arXiv preprint arXiv:2507.17744, 2025. 2

  47. [47]

    Hailuo video model, 2024.https : / / hailuoai.video

    Minimax. Hailuo video model, 2024.https : / / hailuoai.video. 2

  48. [48]

    Gta: A geometry-aware attention mechanism for multi-view transformers

    Takeru Miyato, Bernhard Jaeger, Max Welling, and Andreas Geiger. Gta: A geometry-aware attention mechanism for multi-view transformers. InICLR, 2024. 2

  49. [49]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Fred- eric Besse, Tim Harley, Ann...

  50. [50]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 2, 3

  51. [51]

    You only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

  52. [52]

    Gen3c: 3d-informed world-consistent video generation with precise camera con- trol

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InCVPR, pages 6121–6132, 2025. 2, 7

  53. [53]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 2

  54. [54]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3 10

  55. [55]

    Fast high- resolution image synthesis with latent adversarial diffusion distillation

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high- resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia, pages 1–11, 2024. 3

  56. [56]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InECCV, pages 87–103. Springer, 2024. 3

  57. [57]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021. 2

  58. [58]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  59. [59]

    Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024

    Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024. 2

  60. [60]

    From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025

    Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zi- long Chen, Hongyang Zhang, Jun Zhang, and Yan Lu. From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025

  61. [61]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In ICLR, 2025. 2

  62. [62]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3, 1

  63. [63]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 3

  64. [64]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH, pages 1–11, 2024. 2, 7

  65. [65]

    Video models are zero-shot learners and reasoners

    Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025. 2

  66. [66]

    Difix3d+: Improving 3d reconstructions with single-step diffusion models

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Goj- cic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. InCVPR, pages 26024– 26035, 2025. 7, 2

  67. [67]

    Worldmem: Long- term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

    Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long- term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025. 2, 3, 4, 5

  68. [68]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622,

  69. [69]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2024. 2

  70. [70]

    Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024. 2, 3, 5

  71. [71]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, pages 6613–6623, 2024. 3

  72. [72]

    From slow bidirectional to fast autoregressive video diffusion mod- els

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Free- man, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els. InCVPR, pages 22963–22974, 2025. 3, 5

  73. [73]

    Wonderworld: Interactive 3d scene generation from a single image

    Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image. InCVPR, pages 5916–5926,

  74. [74]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

  75. [75]

    Gamefactory: Creating new games with gen- erative interactive videos

    Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with gen- erative interactive videos. InICCV, 2025. 2

  76. [76]

    Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

    Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, 2025. 2

  77. [77]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 7

  78. [78]

    Accvideo: Accelerating video diffusion model with synthetic dataset.arXiv preprint arXiv:2503.19462, 2025

    Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. Accvideo: Accelerating video diffusion model with synthetic dataset.arXiv preprint arXiv:2503.19462, 2025. 3

  79. [79]

    Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration

    Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration. InICLR, 2025. 6

  80. [80]

    Stable virtual camera: Gener- ative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

    Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rup- precht, and Varun Jampani. Stable virtual camera: Gener- ative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 7

Showing first 80 references.