arxiv: 2604.07209 · v2 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team (Alphabetical Order): Donghui Shen , Guofeng Zhang , Haomin Liu , Haoyu Ji , Hujun Bao , Hongjia Zhai , Jialin Liu , Jing Guo

show 14 more authors

Nan Wang Siji Pan Weihong Pan Weijian Xie Xianbin Liu Xiaojun Xiang Xiaoyu Zhang Xinyu Chen Yifu Wang Yipeng Chen Zhenzhou Fan Zhewen Le Zhichao Ye Ziqiang Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D world modelsspatiotemporal autoregressive modelingspatial consistencyinteractive scene generationreal-time simulationvideo-based reconstructiondynamic environments

0 comments

The pith

INSPATIO-WORLD uses a spatiotemporal autoregressive architecture to generate high-fidelity 4D interactive scenes in real time from a single video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the challenge of creating world models that support spatial consistency and real-time user interaction for navigating complex environments. Existing video generation methods often lose spatial persistence and realism during long sequences. INSPATIO-WORLD addresses this by evolving scenes autoregressively from one reference video, using a cache to keep a consistent latent representation and constraints to make user inputs produce plausible movements. A distillation process helps keep the output realistic even when trained partly on synthetic data. If successful, this would allow practical exploration of dynamic 4D worlds reconstructed from ordinary videos.

Core claim

INSPATIO-WORLD recovers and generates high-fidelity, dynamic interactive scenes from a single reference video through a Spatiotemporal Autoregressive architecture. This architecture uses an Implicit Spatiotemporal Cache to aggregate reference and historical observations into a latent world representation for global consistency, and an Explicit Spatial Constraint Module to enforce geometric structure and translate user interactions into precise, physically plausible camera trajectories. Joint Distribution Matching Distillation uses real-world data distributions to prevent fidelity loss from synthetic data reliance. Experiments show it outperforms state-of-the-art models in spatial consistency

What carries the argument

Spatiotemporal Autoregressive (STAR) architecture consisting of an Implicit Spatiotemporal Cache for maintaining latent world representations and an Explicit Spatial Constraint Module for geometric enforcement and interaction handling.

If this is right

Real-time navigation in 4D environments becomes possible using only monocular video input.
Global consistency is maintained over long-horizon scene generations without external references.
User interactions translate directly into physically plausible trajectories.
Realism is preserved through regularization against real data distributions despite synthetic training components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such a system could lower the barrier for creating interactive simulations in fields like robotics or gaming by relying on readily available video footage.
The cache mechanism might inspire similar consistency-preserving techniques in other sequential generation tasks.
Testing on diverse real-world videos beyond the benchmark could reveal the limits of the spatial consistency claims.

Load-bearing premise

The Implicit Spatiotemporal Cache and Explicit Spatial Constraint Module can together preserve global consistency and physical plausibility in trajectories over long time horizons without losing visual fidelity.

What would settle it

A long navigation sequence generated by the model where object positions drift or geometries become inconsistent with the reference video, or where user-controlled camera paths produce non-physical results.

Figures

Figures reproduced from arXiv: 2604.07209 by Guofeng Zhang, Haomin Liu, Haoyu Ji, Hongjia Zhai, Hujun Bao, InSpatio Team (Alphabetical Order): Donghui Shen, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye, Ziqiang Zhao.

**Figure 1.** Figure 1: INSPATIO-WORLD: Toward a Versatile 4D World Simulator. Top: Our framework enables the synthesis of diverse dynamic scenes from a single video, supporting real-time, high-DoF interactive 4D roaming experiences. Middle: The system is driven by those core capabilities: Free Spatial Roaming along user-defined camera trajectories, Temporal Control over dynamic scene evolution, and the maintenance of Physical … view at source ↗

**Figure 2.** Figure 2: Architecture of the Spatiotemporal Autoregressive Framework and JDMD Pipeline. The framework constructs a spatiotemporal cache using reference information and historical generations, leveraging depth-based warping to establish explicit geometric constraints for consistent autoregressive video generation. The JDMD phase features a multi-task distillation mechanism with shared weights, supervised by a dual-t… view at source ↗

**Figure 3.** Figure 3: Quantitative comparison on WorldScore-Dynamic. Each bubble represents a method, with the vertical axis showing the score of WorldScore-Dynamic and the horizontal axis showing model parameters × inference steps. INSPATIO-WORLD achieves a dynamic score of 68.72 with a significantly lower computational overhead, demonstrating a superior compute-quality trade-off by breaking the zerosum game between geometric… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on RE10K-Long dataset. Qualitative comparison on RE10K-Long. For each of the two scenes, the leftmost image represents the input Source image. For each method, the top row displays the intermediate frame of the generated sequence, while the bottom row showcases the final frame. As generation progresses, baseline methods exhibit varying degrees of failure, such as camera pose drift or… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on Camera Controlled Video Rerendering. Each row represents a distinct scene. From left to right: the first frame of the reference video, the warped final frame, and the final frames generated by TrajectoryCrafter, ReCamMaster, NeoVerse, and our method. Compared to existing methods, our approach yields higher structural fidelity to the original scene and delivers significantly bette… view at source ↗

read the original abstract

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

INSPATIO-WORLD proposes a STAR architecture for real-time 4D simulation but the evidence for sustained long-horizon consistency is missing.

read the letter

This paper puts forward INSPATIO-WORLD as a real-time 4D world simulator built from monocular video. Its main novelty is the Spatiotemporal Autoregressive (STAR) setup that uses an Implicit Spatiotemporal Cache to keep a latent representation of the world and an Explicit Spatial Constraint Module to manage user-driven camera moves into physically plausible paths. They also add Joint Distribution Matching Distillation to pull the model toward real data distributions and reduce reliance on synthetic training. The approach makes sense for fixing the usual problems in video generation, such as losing spatial structure over time and poor support for interactive control. Describing how the cache aggregates observations for global consistency and how the constraint module enforces geometry is a reasonable way to structure the system. The paper does a good job laying out the motivation and the high-level design. If the components work together as claimed, it could offer a usable pipeline for reconstructing navigable scenes. The main weakness is in the experimental backing. The abstract asserts top ranking on the WorldScore-Dynamic benchmark and better spatial consistency than prior methods, but it gives no actual numbers or tables. There are also no ablations showing what each part contributes. The stress-test note is accurate: we see no plots or tables of error growth with longer navigation or more autoregressive steps. Without that, the claim of maintaining consistency over long horizons rests on the architecture description alone rather than data. The full paper may contain these details, but from the provided information the results section looks thin for the scale of the claims. This kind of work is useful for computer vision researchers focused on world models and simulation. Readers who follow video generation and 4D reconstruction will see the architectural choices clearly. It is less valuable for those who require strong empirical validation of long-term behavior. I recommend sending it for peer review. The topic is relevant, the proposed modules are specific, and the referees can require the necessary scaling experiments and quantitative comparisons. It is worth the effort to get feedback on how to strengthen the evaluation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces INSPATIO-WORLD, a real-time 4D world simulator that recovers and generates high-fidelity dynamic interactive scenes from a single reference video. Its core is a Spatiotemporal Autoregressive (STAR) architecture comprising an Implicit Spatiotemporal Cache that aggregates reference and historical observations into a latent world representation for global consistency, an Explicit Spatial Constraint Module that enforces geometric structure and translates user interactions into physically plausible camera trajectories, and Joint Distribution Matching Distillation (JDMD) that uses real-world data distributions to counteract fidelity degradation from synthetic data. The central claim is that the method significantly outperforms existing SOTA models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark.

Significance. If the long-horizon consistency and benchmark superiority claims are substantiated with quantitative evidence, the work would constitute a meaningful step toward practical real-time interactive 4D world models from monocular video, with potential utility in robotics, VR/AR, and simulation. The JDMD regularization approach and the coupling of implicit caching with explicit spatial constraints represent potentially reusable ideas for mitigating drift in autoregressive generation.

major comments (2)

[Abstract / Experiments] Abstract and Experiments: The central claim that the Implicit Spatiotemporal Cache (coupled with the Explicit Spatial Constraint Module) maintains global spatial consistency and produces physically plausible trajectories over long horizons without fidelity loss is load-bearing, yet the manuscript supplies no quantitative scaling analysis. No metrics such as spatial error, reprojection consistency, or trajectory drift are reported as functions of increasing autoregressive steps, navigation length, or video duration on WorldScore-Dynamic or any other benchmark.
[Abstract] Abstract: The assertion of benchmark superiority and first-place ranking among real-time interactive methods is stated without any numerical results, error bars, ablation tables, or comparison details (e.g., exact WorldScore-Dynamic scores versus prior methods). This absence prevents assessment of effect size or whether gains are driven by short-horizon test cases.

minor comments (2)

[Abstract] The abstract introduces three new named components (Implicit Spatiotemporal Cache, Explicit Spatial Constraint Module, Joint Distribution Matching Distillation) without a concise one-sentence definition or pointer to the corresponding section for each.
[Methods] Notation for the STAR architecture and cache update rules should be introduced with a single equation or diagram reference early in the methods to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where additional quantitative evidence can strengthen the presentation of our long-horizon consistency claims and benchmark results. We address each major comment below and will revise the manuscript accordingly to incorporate the requested analyses and numerical details.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments: The central claim that the Implicit Spatiotemporal Cache (coupled with the Explicit Spatial Constraint Module) maintains global spatial consistency and produces physically plausible trajectories over long horizons without fidelity loss is load-bearing, yet the manuscript supplies no quantitative scaling analysis. No metrics such as spatial error, reprojection consistency, or trajectory drift are reported as functions of increasing autoregressive steps, navigation length, or video duration on WorldScore-Dynamic or any other benchmark.

Authors: We agree that an explicit scaling analysis would better substantiate the long-horizon claims. The current manuscript reports aggregate performance metrics and qualitative results across navigation sequences but does not plot or tabulate spatial error, reprojection consistency, or trajectory drift as functions of autoregressive steps or video duration. In the revision we will add a dedicated scaling study in the Experiments section, including these metrics evaluated on WorldScore-Dynamic for increasing horizons (e.g., 50, 100, 200 steps) with corresponding figures and tables. revision: yes
Referee: [Abstract] Abstract: The assertion of benchmark superiority and first-place ranking among real-time interactive methods is stated without any numerical results, error bars, ablation tables, or comparison details (e.g., exact WorldScore-Dynamic scores versus prior methods). This absence prevents assessment of effect size or whether gains are driven by short-horizon test cases.

Authors: The full manuscript contains comparison tables in Section 4 that report exact WorldScore-Dynamic scores, standard deviations, and ablations against prior real-time methods. The abstract currently summarizes the outcome without these numbers. We will revise the abstract to include the key quantitative results (top score and margins versus the next-best real-time baseline) while retaining the overall claim, thereby allowing readers to assess effect size directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in architecture or benchmark claims

full rationale

The paper introduces a Spatiotemporal Autoregressive (STAR) architecture with described components (Implicit Spatiotemporal Cache, Explicit Spatial Constraint Module, JDMD) and reports empirical outperformance on the external WorldScore-Dynamic benchmark. No equations, parameter fits, or derivations are shown that reduce by construction to the target metrics or self-referential definitions. Claims rest on experimental results rather than self-citation chains, uniqueness theorems, or renamed known patterns. The central consistency assertions are presented as design goals validated by benchmarks, not forced by internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The paper introduces multiple new named components and relies on standard deep-learning assumptions about autoregressive modeling and data distillation; full text would likely reveal many neural-network hyperparameters as free parameters. No machine-checked proofs or external independent benchmarks are mentioned.

axioms (2)

domain assumption Spatiotemporal autoregressive models can maintain global consistency across long-horizon navigation
Invoked as the basis for the Implicit Spatiotemporal Cache component.
domain assumption Real-world data distributions can serve as an effective regularizer for synthetic generation via distillation
Core premise of the Joint Distribution Matching Distillation technique.

invented entities (3)

Implicit Spatiotemporal Cache no independent evidence
purpose: Aggregates reference and historical observations into a latent world representation to ensure global consistency
New component introduced to address spatial persistence in navigation.
Explicit Spatial Constraint Module no independent evidence
purpose: Enforces geometric structure and converts user interactions into physically plausible camera trajectories
New module proposed to improve interaction precision.
Joint Distribution Matching Distillation (JDMD) no independent evidence
purpose: Uses real-world data distributions to mitigate fidelity degradation from synthetic training data
New distillation method introduced to improve visual realism.

pith-pipeline@v0.9.0 · 5637 in / 1725 out tokens · 98110 ms · 2026-05-10T19:02:39.285124+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Spatiotemporal Autoregressive (STAR) architecture... Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation... Explicit Spatial Constraint Module enforces geometric structure
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

long-horizon navigation... spatial consistency... 24 FPS real-time

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

Reference graph

Works this paper leans on

110 extracted references · 56 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InInternational Conference on Learning Representations (ICLR), 2025

2025
[2]

arXiv preprint arXiv:2407.12781 , year=

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control.arXiv preprint arXiv:2407.12781, 2024

work page arXiv 2024
[3]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22875–22889, 2025

2025
[4]

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video.IEEE/CVF International Conference on Computer Vision (ICCV), 2025

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, and Di Zhang. ReCamMaster: Camera-Controlled Generative Rendering from A Single Video.IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[5]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Mar- jorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, and Jessica Yu...

2025
[6]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15791–15801, 2025

2025
[7]

GS-DiT: Ad- vancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, and Hongsheng Li. GS-DiT: Ad- vancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking. arXiv preprint arXiv:2501.02690, 2025

work page arXiv 2025
[8]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[9]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[10]

TAEHV: Tiny AutoEncoder for Hunyuan Video.https://github.com/ madebyollin/taehv, 2025

Ollin Boer Bohan. TAEHV: Tiny AutoEncoder for Hunyuan Video.https://github.com/ madebyollin/taehv, 2025

2025
[11]

Video generation models as world simulators, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URLhttps://openai.com/research/ video-generation-models-as-world-simulators

2024
[12]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInt. Conf. Mach. Learn., 2024

2024
[13]

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, and Yanwei Fu. MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 6045–6056, 2025. 14

2025
[14]

Dif- fusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Dif- fusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[15]

TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model, 2025

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, et al. TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model, 2025

2025
[16]

PostCam: Camera-Controllable Novel-View Video Generation with Query- Shared Cross-Attention.arXiv preprint arXiv:2511.17185, 2025

Yipeng Chen, Zhichao Ye, Zhenzhou Fang, Xinyu Chen, Xiaoyu Zhang, Jialing Liu, Nan Wang, Haomin Liu, and Guofeng Zhang. PostCam: Camera-Controllable Novel-View Video Generation with Query- Shared Cross-Attention.arXiv preprint arXiv:2511.17185, 2025

work page arXiv 2025
[17]

FantasyWorld: Geometry-consistent world modeling via unified video and 3d prediction.arXiv preprint arXiv:2509.21657,

Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, and Yonggang Qi. Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction.arXiv preprint arXiv:2509.21657, 2025

work page arXiv 2025
[18]

arXiv preprint arXiv:2412.12095 , year=

Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

work page arXiv 2024
[19]

Autoregressive Video Generation without Vector Quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yong- gang Qi, and Xinlong Wang. Autoregressive Video Generation without Vector Quantization. InInterna- tional Conference on Learning Representations (ICLR), 2025

2025
[20]

WorldScore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. WorldScore: A unified evaluation benchmark for world generation. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 27713–27724, 2025

2025
[21]

arXiv preprint arXiv:2411.06525 , year=

Wanquan Feng, Jiawei Liu, Pengqi Tu, Tianhao Qi, Mingzhen Sun, Tianxiang Ma, Songtao Zhao, Siyu Zhou, and Qian He. I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength. arXiv preprint arXiv:2411.06525, 2024

work page arXiv 2024
[22]

arXiv preprint arXiv:2411.16375 (2024)

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-VDM: Effi- cient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing.arXiv preprint arXiv:2411.16375, 2024

work page arXiv 2024
[23]

VGGT: Visual Geometry Grounded Transformer for One-Shot 3D Reconstruction.arXiv preprint arXiv:2512.xxxxx, 2025

Juan Garrido, Jeremy Reizenstein, Ignacio Rocco, Andrea Vedaldi, et al. VGGT: Visual Geometry Grounded Transformer for One-Shot 3D Reconstruction.arXiv preprint arXiv:2512.xxxxx, 2025

2025
[24]

Long-context autoregressive video modeling with next-frame prediction

Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-Context Autoregressive Video Modeling with Next- Frame Prediction.arXiv preprint arXiv:2503.19325, 2025

work page arXiv 2025
[25]

arXiv preprint arXiv:2501.03847 (2025)

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control. arXiv preprint arXiv:2501.03847, 2025

work page arXiv 2025
[26]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review arXiv 2023
[27]

Long context tuning for video generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

work page arXiv 2025
[28]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024
[29]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review arXiv 2024
[30]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Camerac- trl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 15

work page internal anchor Pith review arXiv 2024
[31]

Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffu- sion models.arXiv preprint arXiv:2503.10592, 2025

work page arXiv 2025
[32]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen Video: High Definition Video Generation with Diffusion Models.ArXiv, abs/2210.02303, 2022. URLhttps: //api.semanticscholar.org/CorpusID:252715883

work page internal anchor Pith review arXiv 2022
[34]

Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

2022
[35]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. InInternational Conference on Learning Representations (ICLR), 2023

2023
[36]

Training-free camera control for video generation.arXiv preprint arXiv:2406.10126,

Chen Hou, Guoqiang Wei, Yan Zeng, and Zhibo Chen. Training-free camera control for video generation. arXiv preprint arXiv:2406.10126, 2024

work page arXiv 2024
[37]

Acdit: Interpolating autoregressive conditional modeling and diffusion transformer.arXiv preprint arXiv:2412.07720, 2024

Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei- Ying Ma, and Maosong Sun. ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer.arXiv preprint arXiv:2412.07720, 2024

work page arXiv 2024
[38]

Motionmaster: Training-free camera motion transfer for video generation,

Teng Hu, Jiangning Zhang, Ran Yi, Yating Wang, Hongrui Huang, Jieyu Weng, Yabiao Wang, and Lizhuang Ma. Motionmaster: Training-free camera motion transfer for video generation.arXiv preprint arXiv:2404.15789, 2024

work page arXiv 2024
[39]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self-Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review arXiv 2025
[40]

VBench: Comprehensive Benchmark Suite for Video Generation

Zanyi Huang, Haoxin He, Chao Jiang, Cuicui Luan, Kai Wang, Xingzhe Wang, Zehuan Yuan, and Zi- wei Liu. VBench: Comprehensive Benchmark Suite for Video Generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[41]

Pyramidal Flow Matching for Efficient Video Generative Modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal Flow Matching for Efficient Video Generative Modeling. InInterna- tional Conference on Learning Representations (ICLR), 2025

2025
[42]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[43]

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. FIFO-Diffusion: Generating Infinite Videos from Text without Training. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[44]

VideoPoet: A Large Language Model for Zero- Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. VideoPoet: A Large Language Model for Zero- Shot Video Generation. InInt. Conf. Mach. Learn., 2024

2024
[45]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collaborative video diffusion: Consistent multi-video generation with camera control.Advances in Neural Information Processing Systems, 37:16240–16271, 2024

2024
[47]

Mirage 2.https://www.mirage2.org/, 2025

World Labs. Mirage 2.https://www.mirage2.org/, 2025. Accessed: 2026-03-11. 16

2025
[48]

Realcam- i2v: Real-world image-to-video generation with interactive complex camera control.arXiv preprint arXiv:2502.10059, 2025

Teng Li, Guangcong Zheng, Rui Jiang, Tao Wu, Yehao Lu, Yining Lin, Xi Li, et al. Realcam- i2v: Real-world image-to-video generation with interactive complex camera control.arXiv preprint arXiv:2502.10059, 2025

work page arXiv 2025
[49]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[50]

Arlon: Boosting diffusion transformers with autoregressive models for long video generation

Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation. InInternational Conference on Learning Representations (ICLR), 2025

2025
[51]

Wonderland: Navigating 3d scenes from a single image

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstanti- nos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 798–810, 2025

2025
[52]

LTX-Video: A DiT-based Video Generation Model.https://github.com/ Lightricks/LTX-Video, 2024

Lightricks. LTX-Video: A DiT-based Video Generation Model.https://github.com/ Lightricks/LTX-Video, 2024

2024
[53]

arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

work page arXiv 2025
[54]

Mo- tionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin. Motionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024

work page arXiv 2024
[55]

Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, and Juan-Manuel Pérez-Rúa

Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

work page arXiv 2024
[56]

Redefining temporal modeling in video diffusion: The vectorized timestep approach

Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H Chan, and Jean-michel Morel. Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach.arXiv preprint arXiv:2410.03160, 2024

work page arXiv 2024
[57]

Autoregressive diffusion transformer for text-to-speech synthesis

Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, and Haizhou Li. Autoregressive diffusion transformer for text-to-speech synthesis.arXiv preprint arXiv:2406.05551, 2024

work page arXiv 2024
[58]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jianzhong Wang, and Hang Zhao. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference.arXiv preprint arXiv:2310.04378, 2023. URLhttps://arxiv.org/abs/2310.04378

work page internal anchor Pith review arXiv 2023
[59]

Osv: One step is enough for high-quality image to video generation

Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang, and Wenhan Luo. Osv: One step is enough for high-quality image to video generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[60]

arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al

YU Mark, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2, 2025

work page arXiv 2025
[61]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

2021
[62]

Hailuo.https://hailuoai.video/, 2024

MiniMax. Hailuo.https://hailuoai.video/, 2024

2024
[63]

X-fusion: Introducing new modality to frozen large language models.arXiv preprint arXiv:2504.20996, 2025

Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tan- don, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, et al. X-Fusion: Introducing New Modality to Frozen Large Language Models.arXiv preprint arXiv:2504.20996, 2025

work page arXiv 2025
[64]

Multidiff: Consistent novel view synthesis from a single image

Norman Müller, Katja Schwarz, Barbara Rössle, Lorenzo Porzi, Samuel Rota Bulò, Matthias Nießner, and Peter Kontschieder. Multidiff: Consistent novel view synthesis from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10258–10268, 2024

2024
[65]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024. 17

work page internal anchor Pith review arXiv 2024
[66]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review arXiv 2024
[67]

CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control.arXiv preprint arXiv:2501.06006, 2025

Stefan Popov, Amit Raj, Michael Krainin, Yuanzhen Li, William T Freeman, and Michael Rubin- stein. CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control.arXiv preprint arXiv:2501.06006, 2025

work page arXiv 2025
[68]

Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al

Shuhuai Ren, Shuming Ma, Xu Sun, and Furu Wei. Next Block Prediction: Video Generation via Semi- Auto-Regressive Modeling.arXiv preprint arXiv:2502.07737, 2025

work page arXiv 2025
[69]

Gen3c: 3d-informed world-consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6121–6132, 2025

2025
[70]

Rolling diffusion models

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models. InInt. Conf. Mach. Learn., 2024

2024
[71]

Gen-3 Alpha: High-Fidelity Video Generation.https://runwayml.com/research/ gen-3-alpha, 2024

Runway. Gen-3 Alpha: High-Fidelity Video Generation.https://runwayml.com/research/ gen-3-alpha, 2024

2024
[72]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. InInter- national Conference on Learning Representations (ICLR), 2022

2022
[73]

MAGI-1: Autoregressive Video Generation at Scale, 2025

Sand-AI. MAGI-1: Autoregressive Video Generation at Scale, 2025. URLhttps://static.magi. world/static/files/MAGI_1.pdf

2025
[74]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review arXiv 2022
[75]

AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[76]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

work page internal anchor Pith review arXiv 2025
[77]

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

InSpatio Team. InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model.arXiv preprint arXiv:2603.11911, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[78]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing Open-source World Models.arXiv preprint arXiv:2601.20540, 2026

work page arXiv 2026
[79]

Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions

R Villegas, H Moraldo, S Castro, M Babaeizadeh, H Zhang, J Kunze, PJ Kindermans, MT Saffar, and D Er- han. Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions. InInternational Conference on Learning Representations (ICLR), 2023

2023
[80]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.