arxiv: 2512.14614 · v1 · submitted 2025-12-16 · 💻 cs.CV · cs.GR

Recognition: 3 theorem links

· Lean Theorem

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun , Haiyu Zhang , Haoyuan Wang , Junta Wu , Zehan Wang , Zhenwei Wang , Yunhong Wang , Jun Zhang

show 2 more authors

Tengfei Wang Chunchao Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:25 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords video diffusionworld modelinggeometric consistencystreaming generationcontext memoryreal-time videointeractive simulationaction conditioning

0 comments

The pith

WorldPlay generates long-horizon 720p video at 24 FPS while preserving geometric consistency through rebuilt context memory in a streaming diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WorldPlay to resolve the speed versus memory trade-off that has limited interactive world modeling. It combines a dual action representation for keyboard and mouse control with two techniques that keep distant past frames available: a memory module that reconstitutes context on the fly and a distillation step that forces alignment between a full-context teacher and a fast student. A sympathetic reader would care because this setup produces interactive scenes that stay coherent over many seconds instead of drifting into geometric nonsense. If the approach holds, it removes a major barrier to real-time simulation of 3D environments directly from video.

Core claim

WorldPlay is a streaming video diffusion model that produces real-time interactive world models with long-term geometric consistency. It rests on three components: a dual action representation that converts user keyboard and mouse inputs into robust control signals, a Reconstituted Context Memory that dynamically rebuilds past frames and applies temporal reframing to retain geometrically critical information, and Context Forcing, a distillation process that aligns memory usage between teacher and student so the faster model does not lose long-range awareness. Together these allow generation of 720p video at 24 frames per second across long horizons while reducing error accumulation.

What carries the argument

Reconstituted Context Memory, which dynamically rebuilds and reframes past frames to keep long-range geometric details accessible, paired with Context Forcing distillation that preserves the student's ability to use that memory at real-time speed.

If this is right

The model produces 720p streaming video at 24 FPS with better geometric consistency than prior methods.
It maintains coherence across long interaction horizons without visible drift.
It generalizes to a wide range of scenes without retraining.
It achieves real-time inference while still using information from frames that would otherwise be forgotten.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the memory alignment technique works, similar distillation could be applied to other streaming generative models that currently suffer from state forgetting.
The same context-rebuilding pattern might allow interactive 3D reconstruction pipelines to operate without maintaining full voxel or mesh histories.
Success here would suggest that explicit temporal reframing can substitute for ever-larger context windows in video models.

Load-bearing premise

The memory-rebuilding and distillation steps actually retain accurate long-range geometry across hundreds of frames without introducing new distortions or needing prohibitive storage.

What would settle it

Side-by-side measurement of object positions and surface normals in generated 720p sequences versus ground-truth geometry after 500 or more frames; any systematic increase in positional error beyond a few pixels would falsify the consistency claim.

read the original abstract

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldPlay packages dual action reps, dynamic memory rebuild with reframing, and context forcing distillation into a streaming diffusion setup that targets real-time 720p interactive modeling at 24 FPS, but the geometric consistency claims lack the long-horizon quantitative checks needed to confirm they hold up.

read the letter

The main takeaway is that this paper takes three engineering moves—dual action representation for keyboard/mouse control, reconstituted context memory that rebuilds and temporally reframes past frames, and context forcing distillation to keep the student model aligned on long-range info—and uses them to push streaming video diffusion past the usual speed-memory wall. The combination lets them claim long-horizon 720p output at interactive rates with better consistency than prior methods, plus decent generalization across scenes. That is the concrete advance: a working recipe for keeping geometry alive in a real-time loop without exploding memory or dropping to offline speeds. The project page and demo indicate they have something that runs, which is useful for anyone trying to build interactive simulators or agent environments. The ideas sit on top of existing video diffusion work but the specific streaming integration for interactive control feels fresh enough to matter. The soft spot is exactly what the stress-test flags: the central promise of drift-free long-term geometry rests on high-level descriptions of the memory and distillation steps rather than tracked metrics like camera-pose error or point-cloud alignment over thousands of frames. Without those numbers or clear ablations showing the methods actually counteract accumulation without new artifacts, it is hard to judge how much the claims hold in practice versus short clips. This is for CV and graphics groups working on real-time world models or synthetic data pipelines. A reader who needs practical streaming techniques will find usable details here. It deserves a serious referee because the problem is real and the approach is specific enough to review and tighten, even if the evidence section needs more quantitative long-horizon tests before publication.

Referee Report

3 major / 2 minor

Summary. This paper presents WorldPlay, a streaming video diffusion model for real-time interactive world modeling that achieves long-term geometric consistency. It introduces three innovations: a Dual Action Representation for handling user keyboard/mouse inputs, Reconstituted Context Memory that dynamically rebuilds context from past frames with temporal reframing to maintain access to geometrically important long-past frames, and Context Forcing, a distillation technique that aligns memory context between teacher and student models to preserve long-range information while enabling real-time inference. The method claims to generate long-horizon 720p video at 24 FPS with superior consistency to existing techniques and strong generalization across diverse scenes.

Significance. If the central claims are substantiated, this would represent a meaningful advance in interactive world modeling by resolving the speed-memory trade-off in streaming video diffusion models. The approach of combining dynamic memory reconstitution with teacher-student alignment for drift prevention could enable practical real-time applications in simulation and VR, and the provided project page with online demo supports reproducibility and immediate usability.

major comments (3)

[Evaluation] The central claim of long-term geometric consistency without cumulative drift rests on Reconstituted Context Memory and Context Forcing, yet the evaluation provides no quantitative long-horizon metrics such as camera-pose drift, point-cloud alignment, or reprojection error tracked across thousands of frames; comparisons appear restricted to short clips or qualitative visuals, leaving the drift-prevention assumption unverified.
[Method] The Reconstituted Context Memory description (dynamic rebuild + temporal reframing) does not specify implementation details such as memory buffer size, exact reframing procedure, or how geometric information is prioritized, making it impossible to assess whether these steps actually counteract diffusion-model error accumulation in streaming mode without introducing new artifacts.
[Method] Context Forcing is presented as the key mechanism for preserving long-range capacity at real-time speeds, but no ablation studies isolate its contribution (e.g., performance with vs. without distillation) or report memory-aware alignment metrics, so its role in preventing error drift remains unquantified.

minor comments (2)

[Abstract] The abstract states that the method 'compares favorably with existing techniques' but does not name the specific baselines or metrics used; adding these would improve clarity.
[Figures] Figure captions and qualitative examples would benefit from explicit frame counts or sequence lengths to allow readers to judge the 'long-horizon' scope directly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the opportunity to address the concerns regarding evaluation metrics, implementation details, and ablation studies. We respond point-by-point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation] The central claim of long-term geometric consistency without cumulative drift rests on Reconstituted Context Memory and Context Forcing, yet the evaluation provides no quantitative long-horizon metrics such as camera-pose drift, point-cloud alignment, or reprojection error tracked across thousands of frames; comparisons appear restricted to short clips or qualitative visuals, leaving the drift-prevention assumption unverified.

Authors: We agree that quantitative long-horizon metrics would provide stronger verification of the drift-prevention claims. While the manuscript includes qualitative results over extended sequences and short-term quantitative comparisons, we will add new experiments in the revised version reporting camera-pose drift, point-cloud alignment, and reprojection error over thousands of frames to directly address this gap. revision: yes
Referee: [Method] The Reconstituted Context Memory description (dynamic rebuild + temporal reframing) does not specify implementation details such as memory buffer size, exact reframing procedure, or how geometric information is prioritized, making it impossible to assess whether these steps actually counteract diffusion-model error accumulation in streaming mode without introducing new artifacts.

Authors: We thank the referee for highlighting this lack of detail. We will expand the method section in the revision to specify the memory buffer size, the exact temporal reframing procedure (including how frames are selected and reframed), and the prioritization mechanism for geometrically important long-past frames, enabling readers to evaluate its effectiveness against error accumulation. revision: yes
Referee: [Method] Context Forcing is presented as the key mechanism for preserving long-range capacity at real-time speeds, but no ablation studies isolate its contribution (e.g., performance with vs. without distillation) or report memory-aware alignment metrics, so its role in preventing error drift remains unquantified.

Authors: We acknowledge that dedicated ablations are needed to isolate Context Forcing's contribution. We will include additional ablation experiments in the revised manuscript comparing performance with and without the distillation step, along with memory-aware alignment metrics, to quantify its role in maintaining long-range information and preventing drift. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on proposed empirical innovations

full rationale

The paper presents WorldPlay via three explicitly novel components (Dual Action Representation, Reconstituted Context Memory with temporal reframing, and Context Forcing distillation) whose descriptions do not reduce to self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. No equations appear in the abstract or summary, and the long-horizon consistency claim is framed as an outcome of these methods rather than a tautological restatement of inputs. The derivation chain is therefore self-contained as a proposal of new techniques evaluated against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all technical details remain at the level of named components without equations or fitting procedures.

pith-pipeline@v0.9.0 · 5535 in / 1030 out tokens · 45741 ms · 2026-05-15T14:25:17.859701+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 conditional novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
cs.CV 2026-05 unverdicted novelty 6.0

Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 6.0

ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
cs.CV 2026-04 unverdicted novelty 6.0

Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
UNICA: A Unified Neural Framework for Controllable 3D Avatars
cs.CV 2026-04 unverdicted novelty 6.0

UNICA unifies motion planning, rigging, physical simulation, and rendering into a single skeleton-free neural framework that produces next-frame 3D avatar geometry from action inputs and renders it with Gaussian splatting.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
cs.CV 2026-03 unverdicted novelty 5.0

InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretra...
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
cs.CV 2026-04 unverdicted novelty 4.0

HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 19 Pith papers · 13 internal anchors

[1]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- ervisto, Amos J Storkey, Tim Pearce, and Franc ¸ois Fleuret. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37: 58757–58791, 2024. 2

work page 2024
[2]

Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers. InCVPR, pages 22875–22889, 2025

work page 2025
[3]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InCVPR, pages 15791–15801, 2025. 2

work page 2025
[4]

Uni3c: Unifying precisely 3d-enhanced camera and hu- man motion controls for video generation.arXiv preprint arXiv:2504.14899, 2025

Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and hu- man motion controls for video generation.arXiv preprint arXiv:2504.14899, 2025. 2

work page arXiv 2025
[5]

Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024. 3

work page arXiv 2024
[6]

Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2, 3

work page 2024
[7]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024. 2

work page 2024
[8]

Self- forcing++: Towards minute-scale high-quality video genera- tion.arXiv preprint arXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self- forcing++: Towards minute-scale high-quality video genera- tion.arXiv preprint arXiv:2510.02283, 2025. 5

work page arXiv 2025
[9]

Oasis: A universe in a transformer.https: //oasis-model.github.io/, 2024

Etched Decart. Oasis: A universe in a transformer.https: //oasis-model.github.io/, 2024. 2, 3, 4

work page 2024
[10]

Veo3 video model, 2025.https:// deepmind.google/models/veo/

Google Deepmind. Veo3 video model, 2025.https:// deepmind.google/models/veo/. 2

work page 2025
[11]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 6

work page arXiv 2025
[12]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xi- aojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step genera- tive modeling.arXiv preprint arXiv:2505.13447, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Animatediff: Animate your personalized text-to- image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to- image diffusion models without specific tuning. InICLR,

work page
[16]

Cameractrl: Enabling camera control for text-to-video generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025. 2, 7

work page 2025
[17]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025. 2, 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text. InCVPR, pages 2568–2577, 2025. 2

work page 2025
[19]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

work page 2020
[20]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 7, 2

work page arXiv 2025
[21]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train- test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2, 3, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 5, 6

work page 2024
[23]

Hunyuanworld 1.0: Generating im- mersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025

Team HunyuanWorld. Hunyuanworld 1.0: Generating im- mersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint, 2025. 3

work page 2025
[24]

Distilling diffusion models into condi- tional gans

Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, and Taesung Park. Distilling diffusion models into condi- tional gans. InECCV, pages 428–447. Springer, 2024. 3

work page 2024
[25]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[26]

Fifo-diffusion: Generating infinite videos from text without 9 training.Advances in Neural Information Processing Sys- tems, 37:89834–89868, 2024

Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without 9 training.Advances in Neural Information Processing Sys- tems, 37:89834–89868, 2024. 2

work page 2024
[27]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3, 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Eschernet: A generative model for scalable view synthesis

Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xi- aojuan Qi, and Andrew J Davison. Eschernet: A generative model for scalable view synthesis. InCVPR, pages 9503– 9513, 2024. 2

work page 2024
[30]

Kling video model, 2024.https : / / klingai.com/global/

Kuaishou. Kling video model, 2024.https : / / klingai.com/global/. 2

work page 2024
[31]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025. 2, 3, 7

work page arXiv 2025
[32]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InICCV, 2025. 2, 3, 7

work page 2025
[33]

Cameras as relative positional encoding

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding. arXiv preprint arXiv:2507.10496, 2025. 2, 4

work page arXiv 2025
[34]

Sequence parallelism: Long sequence train- ing from system perspective

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence train- ing from system perspective. InProceedings of the 61st An- nual Meeting of the Association for Computational Linguis- tics (V olume 1: Long Papers), pages 2391–2404, Toronto, Canada, 2023. Association for Computational Linguistics. 6

work page 2023
[35]

Flashworld: High- quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678, 2025

Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High- quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678, 2025. 3

work page arXiv 2025
[36]

Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025. 7, 2

work page arXiv 2025
[37]

Sdxl- lightning: Progressive adversarial diffusion distillation

Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl- lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024. 3

work page arXiv 2024
[38]

Diffusion adversarial post-training for one-step video generation

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. 2025

work page 2025
[39]

Autoregressive adversarial post-training for real-time inter- active video generation.arXiv preprint arXiv:2506.09350,

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time inter- active video generation.arXiv preprint arXiv:2506.09350,

work page arXiv
[40]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, pages 22160–22169,

work page
[41]

Flow matching for generative mod- eling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InICLR, 2023. 2, 3

work page 2023
[42]

Re- conx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Re- conx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767, 2024. 2

work page arXiv 2024
[43]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025. 5

work page arXiv 2025
[44]

World- mirror: Universal 3d world reconstruction with any-prior prompting.arXiv preprint arXiv:2510.10726, 2025

Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. World- mirror: Universal 3d world reconstruction with any-prior prompting.arXiv preprint arXiv:2510.10726, 2025. 8

work page arXiv 2025
[45]

Adversarial distribution matching for diffusion distilla- tion towards efficient image and video synthesis

Yanzuo Lu, Yuxi Ren, Xin Xia, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Andy J Ma, Xiaohua Xie, and Jian-Huang Lai. Adversarial distribution matching for diffusion distilla- tion towards efficient image and video synthesis. InICCV, pages 16818–16829, 2025. 3

work page 2025
[46]

Yume: An interactive world gen- eration model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wen- shuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world gen- eration model.arXiv preprint arXiv:2507.17744, 2025. 2

work page arXiv 2025
[47]

Hailuo video model, 2024.https : / / hailuoai.video

Minimax. Hailuo video model, 2024.https : / / hailuoai.video. 2

work page 2024
[48]

Gta: A geometry-aware attention mechanism for multi-view transformers

Takeru Miyato, Bernhard Jaeger, Max Welling, and Andreas Geiger. Gta: A geometry-aware attention mechanism for multi-view transformers. InICLR, 2024. 2

work page 2024
[49]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Fred- eric Besse, Tim Harley, Ann...

work page 2024
[50]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 2, 3

work page 2023
[51]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016
[52]

Gen3c: 3d-informed world-consistent video generation with precise camera con- trol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InCVPR, pages 6121–6132, 2025. 2, 7

work page 2025
[53]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 2

work page 2022
[54]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Fast high- resolution image synthesis with latent adversarial diffusion distillation

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high- resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia, pages 1–11, 2024. 3

work page 2024
[56]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InECCV, pages 87–103. Springer, 2024. 3

work page 2024
[57]

Score-based generative modeling through stochastic differential equa- tions

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021. 2

work page 2021
[58]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[59]

Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024

Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024. 2

work page arXiv 2024
[60]

From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025

Wenqiang Sun, Fangyun Wei, Jinjing Zhao, Xi Chen, Zi- long Chen, Hongyang Zhang, Jun Zhang, and Yan Lu. From virtual games to real-world play.arXiv preprint arXiv:2506.18901, 2025

work page arXiv 2025
[61]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In ICLR, 2025. 2

work page 2025
[62]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 3

work page 2023
[64]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH, pages 1–11, 2024. 2, 7

work page 2024
[65]

Video models are zero-shot learners and reasoners

Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Difix3d+: Improving 3d reconstructions with single-step diffusion models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Goj- cic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. InCVPR, pages 26024– 26035, 2025. 7, 2

work page 2025
[67]

Worldmem: Long- term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long- term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025. 2, 3, 4, 5

work page arXiv 2025
[68]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622,

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2024. 2

work page 2024
[70]

Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024. 2, 3, 5

work page 2024
[71]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, pages 6613–6623, 2024. 3

work page 2024
[72]

From slow bidirectional to fast autoregressive video diffusion mod- els

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Free- man, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els. InCVPR, pages 22963–22974, 2025. 3, 5

work page 2025
[73]

Wonderworld: Interactive 3d scene generation from a single image

Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image. InCVPR, pages 5916–5926,

work page
[74]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

work page arXiv
[75]

Gamefactory: Creating new games with gen- erative interactive videos

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with gen- erative interactive videos. InICCV, 2025. 2

work page 2025
[76]

Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, 2025. 2

work page 2025
[77]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Accvideo: Accelerating video diffusion model with synthetic dataset.arXiv preprint arXiv:2503.19462, 2025

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. Accvideo: Accelerating video diffusion model with synthetic dataset.arXiv preprint arXiv:2503.19462, 2025. 3

work page arXiv 2025
[79]

Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration

Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration. InICLR, 2025. 6

work page 2025
[80]

Stable virtual camera: Gener- ative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rup- precht, and Varun Jampani. Stable virtual camera: Gener- ative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 7

work page arXiv 2025

Showing first 80 references.