pith. machine review for the scientific record. sign in

arxiv: 2604.08995 · v2 · submitted 2026-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords interactive video generationworld modelsdiffusion modelsreal-time generationlong-horizon consistencymemory augmentationvideo generation
0
0 comments X

The pith

Matrix-Game 3.0 generates 720p interactive video in real time at 40 frames per second while holding memory consistency over minute-long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Matrix-Game 3.0 as a memory-augmented interactive world model that extends prior versions through three main upgrades: an industrial-scale data engine producing Video-Pose-Action-Prompt quadruplets from synthetic, game, and real-world sources; a training approach that models prediction residuals and re-injects imperfect generated frames so the model learns self-correction, paired with camera-aware memory retrieval and injection for long-horizon spatiotemporal consistency; and a multi-segment autoregressive distillation pipeline using Distribution Matching Distillation together with quantization and VAE pruning for efficient inference. These elements together enable a 5B model to reach 40 FPS at 720p while keeping stable memory over extended sequences, overcoming the prior trade-off between real-time speed, resolution, and long-term consistency in diffusion-based world models. Scaling the approach to a 2x14B model further boosts quality and generalization.

Core claim

By combining an upgraded infinite data engine, residual modeling with re-injection of imperfect frames for self-correction, camera-aware memory retrieval and injection, and DMD-based multi-segment autoregressive distillation with quantization and pruning, Matrix-Game 3.0 achieves up to 40 FPS real-time 720p generation using a 5B model and maintains stable memory consistency over minute-long sequences.

What carries the argument

Camera-aware memory retrieval and injection combined with residual prediction modeling that re-injects generated frames during training to enable self-correction.

If this is right

  • Interactive applications can sustain long-form video generation at real-time speeds without resets or loss of consistency.
  • Larger models trained with the same residual and memory methods show improved dynamics and generalization.
  • The approach supplies a direct route to deployable industrial-scale world models for simulation and gaming.
  • Real-time high-resolution output becomes practical for streaming interactive scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The residual re-injection method may transfer to other video diffusion models to lengthen their reliable generation horizon without extra supervision.
  • Camera-aware memory retrieval indicates that explicit viewpoint conditioning is key to preventing drift in 3D-consistent world models.
  • If self-correction from noisy self-generated data works reliably, training loops could increasingly rely on the model's own outputs rather than only clean ground truth.

Load-bearing premise

Re-injecting imperfect generated frames during training plus camera-aware memory retrieval will produce long-horizon spatiotemporal consistency without visible drift or compounding errors once the model leaves the training distribution.

What would settle it

Generating minute-long interactive sequences on novel out-of-distribution actions or environments and measuring whether object positions, camera trajectories, and visual details remain consistent without accumulated artifacts or drift.

Figures

Figures reproduced from arXiv: 2604.08995 by Baixin Xu, Biao Jiang, Boyi Jiang, Fei Kang, Haofeng Sun, Hua Xue, Jiangbo Pei, Jiaxing Li, Kaichen Huang, Liang Hu, Mengyin An, Peiyu Wang, Wanli Ouyang, Wei Li, Xianglong He, Yahui Zhou, Yangguang Li, Yang Liu, Yichen Wei, Yidan Xietian, Zexiang Liu, Zidong Wang, Zile Wang.

Figure 1
Figure 1. Figure 1: Matrix-Game 3.0 introduces precise action control and long-horizon memory retrieval, enabling an interactive world model with long-term memory and real-time performance of up to 40 FPS. Abstract With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-en… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Matrix-Game 3.0. Our framework unifies Unreal Engine–based data genera￾tion, memory-augmented DiT training with an error buffer, and accelerated real-time deployment. It generates long-horizon training videos with paired action and camera-pose supervision, learns action-conditioned generation with memory-enhanced consistency, and supports real-time inference through few-step sampling, quantizat… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of our interactive base model. We jointly perform error-aware modeling over the past and current latent frames, while explicitly injecting action conditions into the model. This design enables autoregressive, long-horizon interactive generation and maintains consistency with the subsequent distillation stage. To enable precise action control, we follow the design of Matrix-Game 2.0 [15] and Ga… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of our memory-augmented base model. Built upon the bidirectional base model, we incorporate retrieved memory frames as additional conditions and introduce small memory perturbations to enhance robustness. This design enables the base model to jointly model long-term memory, short-term history, and the current prediction target under the same attention mode as the base model. few-step settings.… view at source ↗
Figure 5
Figure 5. Figure 5: Frame-level self-attention visualiza￾tion for the memory-enhanced DiT. Based on these observations, we adopt a unified DiT framework that jointly models long-term memory, temporally consistent history, and the current predic￾tion target. Our first key design is a joint self-attention mech￾anism. Instead of treating memory as an external branch, we place retrieved memory latents, tempo￾rally aligned recent … view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of our few-step distillation stage. The bidirectional student performs multi￾segment rollouts to mimic actual few-step inference, with the final segment used for distribution matching, thereby ensuring training-inference consistency. 3.3 Training-Inference Aligned Few-step Distillation Existing distillation methods [8, 19, 49, 60] typically adopt causal students that perform chunk-wise inferen… view at source ↗
Figure 7
Figure 7. Figure 7: Representative scenes and agent trajectories from our data engine. 4.1 Unreal Engine-based Data Production Our Unreal Engine-based pipeline—Unreal-Gen—produces cinema-quality video from more than 1,000 custom UE5 scenes built on Nanite virtualized geometry and Lumen global illumination. The core design principle is tick-level synchronization: in each rendered frame t, the system simultaneously captures: Dt… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of our interactive base model. The action symbol denotes the action of [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Memory-based scene revisitation in long videos. Each row is sampled uniformly in time; [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results of our 28B model on third-person video generation. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results of our distilled model. Each row is sampled uniformly over time. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: For each case, the top row shows the original video and the bottom row shows the [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
read the original abstract

With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Matrix-Game 3.0, a memory-augmented interactive world model extending Matrix-Game 2.0 for 720p real-time long-form video generation. It introduces an industrial-scale data engine generating Video-Pose-Action-Prompt quadruplets from synthetic Unreal Engine data, AAA game collection, and real-world augmentation; a training framework using residual prediction with imperfect-frame re-injection plus camera-aware memory retrieval/injection for long-horizon spatiotemporal consistency; and multi-segment DMD distillation combined with quantization and VAE pruning for efficient inference. The central claim is that the 5B model achieves up to 40 FPS real-time generation while maintaining stable memory consistency over minute-long sequences, with further gains from scaling to a 2x14B model.

Significance. If independently verified, the combination of real-time high-resolution inference with demonstrated minute-scale consistency would constitute a practical advance for deployable world models in interactive applications. The explicit engineering of self-correction via residual modeling and camera-aware retrieval, together with the large-scale quadruplet data pipeline, supplies a concrete recipe that could be adopted or extended by others working on streaming video generation.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experimental results): The headline performance figures (40 FPS at 720p with the 5B model and stable consistency over minute-long sequences) are reported without any quantitative baseline comparisons, error bars, ablation tables, or failure-case analysis. This absence makes it impossible to isolate the contribution of the residual-prediction objective, the camera-aware memory module, or the DMD distillation from the overall pipeline.
  2. [§3.2] §3.2 (Training framework for long-horizon consistency): The claim that re-injecting imperfect generated frames produces a robust self-correction attractor rests on the assumption that the residual objective generalizes outside the synthetic/game quadruplet distribution. No experiments or analysis are provided that test for compounding spatiotemporal drift when camera poses or scene dynamics deviate from the training data, which directly bears on the minute-scale consistency claim.
  3. [§3.1 and §4] §3.1 (Data engine) and §4: The Video-Pose-Action-Prompt quadruplet engine is described as the foundation for both training and evaluation, yet no quantitative metrics (e.g., diversity statistics, pose-estimation accuracy, or distribution-shift measures) are supplied to show how it differs from prior game or synthetic datasets, nor are any cross-dataset generalization results reported.
minor comments (2)
  1. [Abstract] The notation “2x14B model” is ambiguous; clarify whether this denotes an ensemble, a mixture-of-experts architecture, or simply two independent 14B models.
  2. [Figures] Figure captions and the inference pipeline diagram should explicitly label the memory retrieval/injection points and the DMD segment boundaries to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments identify key areas where additional evidence would strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions in the next version.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental results): The headline performance figures (40 FPS at 720p with the 5B model and stable consistency over minute-long sequences) are reported without any quantitative baseline comparisons, error bars, ablation tables, or failure-case analysis. This absence makes it impossible to isolate the contribution of the residual-prediction objective, the camera-aware memory module, or the DMD distillation from the overall pipeline.

    Authors: We agree that the absence of direct quantitative baselines and component ablations limits the ability to isolate contributions. In the revised manuscript we will add a dedicated comparison table against Matrix-Game 2.0 and other published real-time video generation methods, reporting both speed and long-horizon consistency metrics. We will also include ablation tables that separately disable residual self-correction, camera-aware memory retrieval, and the multi-segment DMD stage, together with error bars computed over multiple evaluation seeds. A short failure-case analysis with representative drift examples will be added to the experimental section. revision: yes

  2. Referee: [§3.2] §3.2 (Training framework for long-horizon consistency): The claim that re-injecting imperfect generated frames produces a robust self-correction attractor rests on the assumption that the residual objective generalizes outside the synthetic/game quadruplet distribution. No experiments or analysis are provided that test for compounding spatiotemporal drift when camera poses or scene dynamics deviate from the training data, which directly bears on the minute-scale consistency claim.

    Authors: The residual objective is trained on imperfect frames produced by the model itself within the quadruplet distribution, which already contains substantial variation in pose and dynamics. Nevertheless, we acknowledge the lack of explicit out-of-distribution tests. In the revision we will add controlled experiments that perturb camera trajectories and introduce scene elements outside the training distribution, then measure spatiotemporal drift over minute-scale rollouts. These results will be reported alongside the existing consistency metrics to directly address generalization of the self-correction mechanism. revision: yes

  3. Referee: [§3.1 and §4] §3.1 (Data engine) and §4: The Video-Pose-Action-Prompt quadruplet engine is described as the foundation for both training and evaluation, yet no quantitative metrics (e.g., diversity statistics, pose-estimation accuracy, or distribution-shift measures) are supplied to show how it differs from prior game or synthetic datasets, nor are any cross-dataset generalization results reported.

    Authors: We agree that quantitative characterization of the data pipeline would help readers assess its novelty and coverage. In the revised manuscript we will report diversity statistics (scene category coverage, action entropy, camera trajectory variance), pose-estimation accuracy on a held-out validation set, and distribution-shift metrics (e.g., Fréchet video distance) relative to prior game and synthetic datasets. We will also include cross-dataset generalization results by evaluating the trained model on external real-world video sequences without additional fine-tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed results or methods

full rationale

The manuscript describes an empirical engineering pipeline (data engine, residual modeling with imperfect-frame re-injection, camera-aware retrieval, and DMD-based distillation) and reports measured outcomes (40 FPS at 720p, minute-scale consistency) from running that pipeline on its own data and models. No mathematical derivation, equation, or theorem is presented that reduces by construction to its own inputs; no load-bearing self-citations or uniqueness theorems are invoked; performance figures are direct experimental measurements rather than independent predictions. The work is therefore self-contained as a system report.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claims rest on the assumption that the described residual self-correction and camera-aware memory mechanisms generalize beyond the authors' curated quadruplet data; several model sizes and distillation hyperparameters are introduced without external grounding.

free parameters (2)
  • 5B and 2x14B model sizes
    Chosen to balance speed and quality; no derivation given for these exact capacities.
  • DMD distillation segments and quantization bits
    Tuned to reach 40 FPS; values are not derived from first principles.
axioms (2)
  • domain assumption Re-injecting imperfect frames during training teaches reliable self-correction
    Invoked in the training framework section of the abstract without proof or external validation.
  • domain assumption Camera-aware memory retrieval preserves spatiotemporal consistency over minutes
    Central to the long-horizon claim; treated as effective once implemented.
invented entities (2)
  • Video-Pose-Action-Prompt quadruplet data engine no independent evidence
    purpose: Industrial-scale training data source combining synthetic and real video
    New data pipeline introduced to support the memory-augmented model
  • camera-aware memory retrieval and injection module no independent evidence
    purpose: Enable long-horizon consistency by storing and retrieving past camera states
    Core architectural addition not present in standard diffusion video models

pith-pipeline@v0.9.0 · 5684 in / 1625 out tokens · 42327 ms · 2026-05-10T18:20:41.173890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  2. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  3. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

Reference graph

Works this paper leans on

60 extracted references · 37 canonical work pages · cited by 2 Pith papers · 12 internal anchors

  1. [1]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 17

  2. [2]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

  3. [3]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning, 2024

  4. [4]

    Yuille, Leonidas J

    Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan L. Yuille, Leonidas J. Guibas, Maneesh Agrawala, Lu Jiang, and Gordon Wetzstein. Mixture of contexts for long video generation. InInternational Conference on Learning Representations (ICLR), 2026

  5. [5]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  6. [6]

    arXiv preprint arXiv:2506.01103 (2025) 4

    Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model. arXiv preprint arXiv:2506.01103, 2025

  7. [7]

    Lightx2v: Light video generation inference framework, 2025

    LightX2V Contributors. Lightx2v: Light video generation inference framework, 2025. GitHub repository

  8. [8]

    arXiv preprint arXiv:2510.02283 (2025)

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

  9. [9]

    Lol: Longer than longer, scaling video generation to hour, 2026

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Lol: Longer than longer, scaling video generation to hour, 2026

  10. [10]

    Oasis: A universe in a transformer

    Decart. Oasis: A universe in a transformer. 2024

  11. [11]

    The matrix: Infinite-horizon world generation with real-time moving control

    Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control. arXiv preprint arXiv:2412.03568, 2024

  12. [12]

    arXiv preprint arXiv:2602.06949 , year=

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  13. [13]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

  14. [14]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

  15. [15]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  16. [16]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

  17. [17]

    AstraNav-World: World Model for Foresight Control and Consistency

    Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, et al. Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025

  18. [18]

    arXiv preprint arXiv:2508.10934 (2025)

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025

  19. [19]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 18

  20. [20]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

  21. [21]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  22. [22]

    World Labs. Marble. https://www.worldlabs.ai/blog/marble-world-model, 2025. Accessed: 2026-03-27

  23. [23]

    Vmem: Consistent interactive video scene generation with surfel-indexed view memory

    Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025

  24. [24]

    Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212,

    Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

  25. [25]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  26. [26]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision, 2023

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision, 2023

  27. [27]

    Yume-1.5: A text-controlled interactive world generation model,

    Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

  28. [28]

    arXiv preprint arXiv:2601.07823 , year=

    Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026

  29. [29]

    Sora: Video generation models as world simulators

    OpenAI. Sora: Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024

  30. [30]

    Genie 2: A large-scale foundation world model.URL: https://deepmind

    J Parker-Holder, P Ball, J Bruce, V Dasagi, K Holsheimer, C Kaplanis, A Moufarek, G Scully, J Shar, J Shi, et al. Genie 2: A large-scale foundation world model.URL: https://deepmind. google/discover/blog/genie- 2-a-large-scale-foundation-world-model, 2024

  31. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision, pages 4195–4205, 2023

  32. [32]

    Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614,

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

  33. [34]

    Hunyuan-gamecraft-2: Instruction-following interactive game world model,

    Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, et al. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

  34. [35]

    Hunyuan- world 1.0: Generating immersive, explorable, and inter- active 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809,

    HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

  35. [36]

    CoRR , volume =

    Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

  36. [37]

    Advancing open-source world models, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models, 2026

  37. [38]

    Advancing open-source world models,

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

  38. [39]

    Deep patch visual odometry, 2023

    Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry, 2023

  39. [40]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025. 19

  40. [41]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  41. [42]

    Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676,

    Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

  42. [43]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  43. [44]

    WorldCompass: Reinforcement learning for long-horizon world models, 2026

    Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. Worldcompass: Reinforcement learning for long-horizon world models.arXiv preprint arXiv:2602.09022, 2026

  44. [45]

    arXiv preprint arXiv:2504.12369 (2025)

    Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

  45. [46]

    Matrix-3d: Omnidirectional explorable 3d world generation.arXiv preprint arXiv:2508.08086, 2025

    Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, et al. Matrix-3d: Omnidirectional explorable 3d world generation.arXiv preprint arXiv:2508.08086, 2025

  46. [47]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  47. [48]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  48. [49]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. 2025

  49. [50]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

  50. [51]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In SIGGRAPH Asia 2025 Conference Papers, pages 19:1–19:11, 2025

  51. [52]

    Gamefactory: Creating new games with generative interactive videos

    Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InInternational Conference on Computer Vision, 2025

  52. [53]

    Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F

    Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models. arXiv preprint arXiv:2603.17117, 2026

  53. [54]

    Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan L

    Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

  54. [55]

    Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051,

    Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026

  55. [56]

    Matrix-game: Interactive world foundation model

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

  56. [57]

    Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements, 2025

    Guangcong Zheng, Teng Li, Xianpan Zhou, and Xi Li. Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements, 2025

  57. [58]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

  58. [59]

    Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

    Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

  59. [60]

    Causal forcing: Autoregressivediffusiondistillationdonerightforhigh-qualityreal-timeinteractivevideogeneration

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

  60. [61]

    Turbo-vaed: Fast and stable transfer of video-vaes to mobile devices

    Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, and Xinggang Wang. Turbo-vaed: Fast and stable transfer of video-vaes to mobile devices. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 14086–14094, 2026. 20