pith. sign in

arxiv: 2505.21996 · v3 · submitted 2025-05-28 · 💻 cs.CV · cs.AI

Learning World Models for Interactive Video Generation

Pith reviewed 2026-05-19 12:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video generationworld modelsinteractive videovideo retrieval augmented generationcompounding errorsspatiotemporal consistencyautoregressive generationglobal state conditioning
0
0 comments X p. Extension

The pith

Video retrieval augmented generation with explicit global state conditioning reduces compounding errors and improves consistency in interactive video world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to build foundational world models that support interactive video generation while maintaining long-term spatiotemporal coherence. Current autoregressive approaches suffer from irreducible compounding errors and weak memory, leading to incoherent future predictions. By retrieving relevant past video clips and conditioning generation on an explicit global state, the proposed VRAG method mitigates these issues more effectively than simply extending context or using basic retrieval. This matters because better world models would enable more reliable planning and action selection in dynamic environments.

Core claim

Foundational world models for interactive video must address compounding errors, which are inherently irreducible in autoregressive setups, and insufficient memory mechanisms that cause incoherence. Enhancing image-to-video models with action conditioning and autoregressive generation reveals these limits, while video retrieval augmented generation (VRAG) paired with explicit global state conditioning significantly reduces long-term errors and boosts spatiotemporal consistency.

What carries the argument

Video retrieval augmented generation (VRAG) with explicit global state conditioning, which augments the generation process by retrieving past clips and maintaining a global state to preserve coherence over time.

If this is right

  • Interactive video generation becomes feasible for longer sequences without rapid loss of consistency.
  • World models can better support future planning with action choices in simulated environments.
  • Current limitations in video models' in-context learning are bypassed by explicit retrieval rather than relying on context windows alone.
  • Naive extensions like longer contexts or basic retrieval prove less effective, highlighting the need for structured augmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar retrieval and state mechanisms could improve other autoregressive generative models in domains like text or audio.
  • Implementing VRAG might allow incremental improvements to existing video models without complete retraining from scratch.
  • This approach could be tested in real-world robotics or game environments to measure planning accuracy gains.

Load-bearing premise

That the main problems in video world models stem from insufficient memory and that retrieving past clips with global state can fix incoherence without creating new inconsistencies or needing full model retraining.

What would settle it

A direct comparison experiment showing whether videos generated with VRAG maintain object positions and scene coherence over many more frames than standard autoregressive methods, or if errors still accumulate similarly.

Figures

Figures reproduced from arXiv: 2505.21996 by Chi Jin, Taiye Chen, Xun Hu, Zihan Ding.

Figure 1
Figure 1. Figure 1: A world model possesses memory capa￾bilities and enables faithful long-term future predic￾tion by maintaining awareness of its environment and generating predictions based on the current state and actions. Example is in Minecraft game. Foundational world models capable of simulat￾ing future outcomes based on different actions are crucial for effective planning and decision￾making [1, 2, 3]. To achieve this… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our VRAG framework for interactive video generation. The framework [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison of VRAG with ground truth videos on world coherence evaluation. With [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison of different methods, evaluated for world [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SSIM scores over time for different meth [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison of long-term video prediction (1200 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SSIM scores over time for compound￾ing error evaluation Method SSIM ↑ DF (window 10) 0.297 DF (window 20) 0.321 YaRN 0.316 History Buffer 0.188 Neural Memory 0.283 VRAG 0.349 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualized video frames on RealEstate10K dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of SSIM scores over time for VRAG variants. Method SSIM ↑ PSNR ↑ LPIPS ↓ VRAG 0.506 17.097 0.506 VRAG (no training) 0.455 16.670 0.528 VRAG (no memory) 0.436 16.372 0.547 [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of SSIM, PSNR, LPIPS, and discriminator metrics. All metrics are normalized [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of vanilla long-context extension for DF model and YaRN with window [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of vanilla long-context extension for DF model and YaRN with window [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of vanilla long-context extension for DF model and YaRN with window [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visual comparison of vanilla long-context extension for DF model and YaRN. Both [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Training Loss Curves C.4 Predicted Global State In the paper, our main experiments are conducted with the access to the ground-truth global state as conditions during training and inference. However, the practical usage may require the global state to be also predicted based on historical states and actions. To ablate this effect, we trained a pose (global state) prediction model that takes the current fr… view at source ↗
Figure 16
Figure 16. Figure 16: World coherence evaluation on all methods for PSNR (left) and LPIPS (right). [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Compounding error evaluation on all methods for PSNR (left) and LPIPS (right). [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Ablation study of VRAG components for world coherence (left) and compounding error [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Ablation study of VRAG components for world coherence (left) and compounding error [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Ablation study of VRAG components for world coherence (left) and compounding error [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
read the original abstract

Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies compounding errors and insufficient memory as core limitations in autoregressive video generation for world models. It augments image-to-video models with action conditioning, asserts that compounding error is inherently irreducible under autoregressive generation, and proposes video retrieval augmented generation (VRAG) with explicit global state conditioning to reduce long-term errors and improve spatiotemporal consistency. It further claims that naive extended-context autoregressive generation and standard retrieval-augmented generation are less effective due to limited in-context learning in current video models, while positioning the work as establishing a benchmark for internal world modeling capabilities.

Significance. If the claimed reductions in compounding error and gains in consistency are demonstrated, the introduction of VRAG with global state conditioning would address a practically important bottleneck in long-horizon interactive video generation, offering a concrete direction for memory-augmented world models beyond simple context extension.

major comments (2)
  1. [Abstract] Abstract: the claim that 'compounding error is inherently irreducible in autoregressive video generation' is presented as a foundational revelation motivating VRAG, yet the manuscript supplies neither a formal argument, mathematical characterization, nor any empirical measurement of this irreducibility.
  2. [Abstract] Abstract: the assertion that VRAG 'significantly reduces long-term compounding errors and increases spatiotemporal consistency' is the central empirical claim, but the text contains no experimental protocol, quantitative metrics, baselines, or results that would allow verification of these improvements.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'establishes a comprehensive benchmark' is used without any description of the benchmark's tasks, metrics, or evaluation protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major points below and will revise the manuscript to better support the claims presented in the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'compounding error is inherently irreducible in autoregressive video generation' is presented as a foundational revelation motivating VRAG, yet the manuscript supplies neither a formal argument, mathematical characterization, nor any empirical measurement of this irreducibility.

    Authors: We acknowledge that the abstract presents this claim concisely without a formal argument, mathematical characterization, or empirical measurement. The abstract is a high-level summary. We will revise the manuscript to include a dedicated discussion with a simple mathematical model of error propagation in autoregressive frame prediction and empirical measurements from long-horizon experiments showing persistent compounding even under extended context. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that VRAG 'significantly reduces long-term compounding errors and increases spatiotemporal consistency' is the central empirical claim, but the text contains no experimental protocol, quantitative metrics, baselines, or results that would allow verification of these improvements.

    Authors: We agree that the abstract states the empirical claim without including the experimental protocol, quantitative metrics, baselines, or results. These elements appear in the experimental sections of the full manuscript. To address the concern, we will revise the abstract to briefly note the evaluation metrics (such as spatiotemporal consistency scores) and the main baselines (naive autoregressive and standard RAG) so that the improvements can be more readily understood and verified. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in available text

full rationale

The provided abstract states observations on limitations of current video generation models (compounding errors and insufficient memory) and proposes VRAG with explicit global state conditioning as an enhancement. No equations, detailed derivation steps, fitted parameters, or self-citations appear in the text. Claims such as the inherent irreducibility of compounding errors in autoregressive setups are presented as revelations without any shown reduction to inputs by construction, self-definitional loops, or renaming of known results. The central proposal remains a high-level method suggestion rather than a closed loop equivalent to its own premises, making the argument self-contained at the level of the abstract.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that memory insufficiency is the dominant source of long-term incoherence and that retrieval plus global conditioning can mitigate it without new failure modes. No free parameters or invented physical entities are mentioned.

axioms (2)
  • domain assumption Compounding error is inherently irreducible in autoregressive video generation
    Stated directly in the abstract as a revealed fact.
  • domain assumption Current video models have limited in-context learning capabilities
    Used to explain why extended context windows and naive retrieval are insufficient.
invented entities (1)
  • VRAG (video retrieval augmented generation) no independent evidence
    purpose: Explicit global state conditioning to reduce compounding errors in long video generation
    New method name and mechanism introduced in the abstract without external validation details.

pith-pipeline@v0.9.0 · 5658 in / 1425 out tokens · 37837 ms · 2026-05-19T12:29:22.649684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 17 internal anchors

  1. [1]

    Embed to control: A locally linear latent dynamics model for control from raw images.Advances in neural information processing systems, 28, 2015

    Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images.Advances in neural information processing systems, 28, 2015

  2. [2]

    Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018

  3. [3]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

  4. [4]

    Weiss, Niru Maheswaranathan, and Surya Ganguli

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, pages 2256–2265, 2015

  5. [5]

    Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

  6. [6]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  7. [7]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  8. [8]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

  9. [9]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  10. [10]

    Diffusion Models Are Real-Time Game Engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

  11. [11]

    Scaling autoregressive video models

    Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019

  12. [12]

    Flexible diffusion modeling of long videos.arXiv preprint arXiv:2205.11495, 2022

    William Harvey, Søren Nørskov, Niklas Kölch, and George V ogiatzis. Flexible diffusion modeling of long videos.arXiv preprint arXiv:2205.11495, 2022

  13. [13]

    Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

    Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

  14. [14]

    Progressive autoregressive video diffusion models.arXiv preprint arXiv:2410.08151, 2024

    Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregressive video diffusion models.arXiv preprint arXiv:2410.08151, 2024

  15. [15]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  16. [16]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  17. [17]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2:1, 2023. 11

  18. [18]

    Retrieval-Augmented Generation for AI-Generated Content: A Survey

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

  19. [19]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023

  20. [20]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7312–7322, 2023

  21. [21]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  22. [22]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024

  23. [23]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models.arXiv preprint arXiv:2204.03458, 2022

  24. [24]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Eliya Nachmani, Guy Dahan, Eli Shechtman, and Haggai Ha- cohen. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  25. [25]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Yu Hong, Jing Wei, Xing Liu, Xiaodi Wang, Yutong Bai, Haitao Li, Ming Zhang, and Hao Xu. Cogvideo: Large-scale pretraining for text-to-video generation with transformers.arXiv preprint arXiv:2205.15868, 2022

  26. [26]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  27. [27]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

  28. [28]

    DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

    Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Krishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, and Yuchen Liu. Dollar: Few-step video generation via distillation and latent reward optimization.arXiv preprint arXiv:2412.15689, 2024

  29. [29]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

  30. [30]

    Auto-encoding variational bayes, 2013

    Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013

  31. [31]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  32. [32]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  33. [33]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  34. [34]

    Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024

    Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, et al. Slowfast-vgen: Slow-fast learning for action-driven long video generation.arXiv preprint arXiv:2410.23277, 2024. 12

  35. [35]

    ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

    Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

  36. [36]

    Fifo-diffusion: Generating infinite videos from text without training.arXiv preprint arXiv:2405.11473, 2024

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training.arXiv preprint arXiv:2405.11473, 2024

  37. [37]

    The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

    Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

  38. [38]

    Magi-1: Autoregressive video generation at scale, 2025

    Sand-AI. Magi-1: Autoregressive video generation at scale, 2025

  39. [39]

    Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

    Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

  40. [40]

    Mcvd-masked conditional video dif- fusion for prediction, generation, and interpolation.Advances in neural information processing systems, 35:23371–23385, 2022

    Vikram V oleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video dif- fusion for prediction, generation, and interpolation.Advances in neural information processing systems, 35:23371–23385, 2022

  41. [41]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  42. [42]

    Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024

    Zihan Ding, Amy Zhang, Yuandong Tian, and Qinqing Zheng. Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024

  43. [43]

    Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  44. [44]

    Oasis: A universe in a transformer

    Decart, Etched, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. 2024

  45. [45]

    Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

    Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

  46. [46]

    Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

    Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

  47. [47]

    Gamefactory: Creating new games with generative interactive videos.arXiv preprint arXiv:2501.08325, 2025

    Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos.arXiv preprint arXiv:2501.08325, 2025

  48. [48]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

  49. [49]

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

  50. [50]

    Navigation world models.arXiv preprint arXiv:2412.03572, 2024

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models.arXiv preprint arXiv:2412.03572, 2024

  51. [51]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna ...

  52. [52]

    Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025

    Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, et al. Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025

  53. [53]

    Gen3c: 3d-informed world- consistent video generation with precise camera control.arXiv preprint arXiv:2503.03751, 2025

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control.arXiv preprint arXiv:2503.03751, 2025

  54. [54]

    Reconx: Reconstruct any scene from sparse views with video diffusion model

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model. arXiv preprint arXiv:2408.16767, 2024

  55. [55]

    Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

  56. [56]

    Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025

    Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025

  57. [57]

    Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

    Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xin- gang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

  58. [58]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  59. [59]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

  60. [60]

    Leave no context behind: Efficient infinite context transformers with infini-attention, 2024

    Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention, 2024

  61. [61]

    Packing input frame context in next-frame prediction models for video generation, 2025

    Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation, 2025

  62. [62]

    Minerl: A large-scale dataset of minecraft demonstrations

    William H Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv:1907.13440, 2019

  63. [63]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600– 612, 2004

  64. [64]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  65. [65]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pag...

  66. [66]

    History-guided video diffusion, 2025

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion, 2025

  67. [67]

    Realestate10k

    Google. Realestate10k. https://google.github.io/realestate10k/index.html, 2018. Accessed: 2025-07-27

  68. [68]

    Limitations

    Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634, 2024. 14 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? ...

  69. [69]

    Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...