DisCo: World Models with Discrete Camera Motion Control

Hongrui Huang; Junke Wang; Quanhao Li; Yu-Gang Jiang; Zuxuan Wu

arxiv: 2606.07967 · v1 · pith:EPOZIZZUnew · submitted 2026-06-06 · 💻 cs.CV

DisCo: World Models with Discrete Camera Motion Control

Hongrui Huang , Junke Wang , Quanhao Li , Yu-Gang Jiang , Zuxuan Wu This is my paper

Pith reviewed 2026-06-27 20:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords controllable video generationworld modelsdiscrete action primitivescamera motion controlaction separabilityvideo synthesisDisCoBench

0 comments

The pith

Discrete camera motion primitives replace continuous trajectories to improve action controllability in video world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most controllable video world models condition generation on continuous camera trajectories, but this produces high feature similarity across different motion patterns and weakens reliable action following. The paper identifies this entanglement as the core bottleneck and replaces it with conditioning on a compact set of discrete action primitives. The resulting model, DisCo, is evaluated on DisCoBench, which tests short-term, long-horizon, and highly dynamic camera control. Experiments show that the discrete representation yields more faithful execution of explicit commands while preserving visual quality and temporal coherence.

Core claim

Continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Conditioning generation instead on a compact set of discrete action primitives improves action separability, enabling significantly more reliable action following in controllable video generation while preserving visual quality.

What carries the argument

DisCo conditions video generation on a compact set of discrete action primitives chosen to increase separability between distinct camera motions.

If this is right

Explicit action commands become more faithfully executed across complex and extended motion sequences.
Action separability improves without trade-off in visual quality or coherence.
Standardized testing of controllability becomes possible via DisCoBench for short-term, long-horizon, and dynamic cases.
World models can support interactive exploration with explicit discrete controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same discretization principle might reduce action-space complexity in other generative control tasks.
Hierarchical or learned primitives could become necessary if the fixed set proves insufficient for certain domains.
Training efficiency may increase because the model no longer needs to distinguish near-identical continuous trajectories.

Load-bearing premise

A small fixed set of discrete primitives can express the full range of complex, long-horizon camera motions required for realistic world exploration.

What would settle it

A demonstration that some realistic long camera sequence cannot be formed from the discrete primitives without measurable loss in visual fidelity or temporal coherence would falsify the central claim.

read the original abstract

Controllable video world models target interactive world exploration, where models must faithfully execute explicit action commands while preserving visual quality and temporal coherence. However, most existing approaches rely on continuous camera trajectories as action conditions, which often lead to unreliable action following, especially under complex motion sequences. In this work, we identify action representation entanglement as a key bottleneck in controllable video generation, and show that continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Based on this insight, we propose DisCo, a controllable video world model that conditions generation on a compact set of discrete action primitives to improve action separability. We further introduce DisCoBench, a comprehensive benchmark for evaluating the ability of models in short-term, long-horizon, and highly dynamic exploration scenarios. Extensive experiments demonstrate that DisCo achieves significantly more reliable action following while preserving visual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The key point is that continuous camera trajectories cause feature entanglement hurting controllability, and discrete primitives are offered as the fix along with a new benchmark.

read the letter

The key point is that this paper claims continuous camera trajectories cause high feature similarity across distinct motion patterns, which degrades action controllability in video world models, and that discrete action primitives solve it by improving separability. They introduce DisCo based on this insight along with DisCoBench for testing short-term, long-horizon, and highly dynamic cases.

The work is new in its specific framing of discrete primitives as the solution to the entanglement problem in camera-conditioned models.

It does well in highlighting a practical issue and providing a benchmark that goes beyond basic scenarios to include complex exploration.

The experiments are described as demonstrating more reliable action following with maintained visual quality.

Soft spots include the need to verify how the discrete set was selected and whether it is expressive enough for long-horizon motions, as the abstract gives no numbers on the feature similarity or the coverage of the primitives.

The argument itself is internally consistent with no circularity.

This paper is aimed at researchers in computer vision focused on controllable video generation and world models.

Readers dealing with action control in simulation or robotics applications would get value from the benchmark and the proposed method, assuming the results are robust.

It deserves a serious referee because it tackles a relevant bottleneck with a new approach and evaluation tool.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies action representation entanglement in continuous camera trajectories as a key bottleneck for controllable video world models, where distinct motion patterns exhibit high feature similarity that degrades action controllability. It proposes DisCo, a model conditioned on a compact set of discrete action primitives to improve separability, introduces the DisCoBench benchmark covering short-term, long-horizon, and highly dynamic scenarios, and reports that experiments demonstrate significantly more reliable action following while preserving visual quality and temporal coherence.

Significance. If the empirical results hold under rigorous controls, the work would provide a concrete demonstration that discrete primitives can mitigate entanglement issues in action-conditioned video generation, offering a practical alternative for interactive world models. The introduction of DisCoBench as a standardized evaluation suite for long-horizon camera control would also be a useful community resource, particularly if the benchmark protocols are reproducible and falsifiable.

major comments (2)

[DisCoBench and Experiments sections] The central claim that discrete primitives remain sufficiently expressive for complex, long-horizon camera motions (the weakest assumption noted in the stress-test) is load-bearing but receives no quantitative coverage analysis or ablation in the provided description of DisCoBench or the experimental protocol; without this, it is unclear whether the reported gains generalize beyond the tested scenarios.
[Introduction / Motivation] The identification of 'high feature similarity across distinct motion patterns' in continuous representations is asserted as the bottleneck, yet the abstract and summary provide no derivation, similarity metric, or quantitative comparison (e.g., cosine similarity or mutual information between features) to support the entanglement diagnosis before proposing the discrete fix.

minor comments (2)

Notation for the discrete action primitives and how they are selected or learned should be clarified with a formal definition or pseudocode to aid reproducibility.
[Abstract] The abstract states 'extensive experiments' but does not preview key metrics (e.g., action accuracy, FID, or temporal consistency scores) or baseline comparisons; adding one sentence would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: The identification of 'high feature similarity across distinct motion patterns' in continuous representations is asserted as the bottleneck, yet the abstract and summary provide no derivation, similarity metric, or quantitative comparison (e.g., cosine similarity or mutual information between features) to support the entanglement diagnosis before proposing the discrete fix.

Authors: The full manuscript provides a quantitative analysis of feature similarity (using cosine similarity) between distinct motion patterns under continuous representations in the motivation section to support the entanglement diagnosis. However, we acknowledge that the abstract is concise and does not include these details. We will revise the abstract to briefly reference the quantitative evidence, improving clarity without changing the claims. revision: yes
Referee: The central claim that discrete primitives remain sufficiently expressive for complex, long-horizon camera motions (the weakest assumption noted in the stress-test) is load-bearing but receives no quantitative coverage analysis or ablation in the provided description of DisCoBench or the experimental protocol; without this, it is unclear whether the reported gains generalize beyond the tested scenarios.

Authors: We agree this is a valid concern and that the current manuscript lacks explicit quantitative coverage analysis or ablation for the expressiveness of discrete primitives in long-horizon scenarios. We will add such an analysis and ablation study to the DisCoBench and Experiments sections in the revision, including metrics on motion coverage to demonstrate sufficiency for the tested scenarios. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation begins from an empirical observation about feature similarity in continuous camera representations and proposes discrete action primitives as an alternative to improve separability. No equations, fitted parameters, or self-citations are presented that reduce this insight or the resulting model to a self-referential definition or construction. The central claim rests on an identified bottleneck and a proposed architectural change, with no load-bearing steps that equate predictions to inputs by construction. The argument remains self-contained without invoking uniqueness theorems, ansatzes smuggled via citation, or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only abstract available; ledger populated from stated assumptions in the text.

axioms (1)

domain assumption Discrete action primitives suffice to represent the range of camera motions needed for short-term, long-horizon, and dynamic exploration without expressiveness loss.
Invoked when proposing the method as a replacement for continuous trajectories.

invented entities (2)

DisCo model no independent evidence
purpose: Video world model conditioned on discrete action primitives
New architecture introduced to implement the discrete-control idea.
DisCoBench no independent evidence
purpose: Benchmark for short-term, long-horizon, and highly dynamic camera control evaluation
New evaluation suite released with the method.

pith-pipeline@v0.9.1-grok · 5683 in / 1182 out tokens · 19768 ms · 2026-06-27T20:19:35.208353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 19 linked inside Pith

[1]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, 2025

2025
[2]

Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

arXiv 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shĳie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprintarXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[4]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InCVPR, 2025

2025
[5]

Stable video diffusion: Scaling latent video diffusion models to large datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXivpreprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023
[6]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 2024

2024
[7]

Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

arXiv 2025
[8]

Self-forcing++: Towards minute-scale high-quality video generation.arXivpreprintarXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXivpreprintarXiv:2510.02283, 2025

Pith/arXiv arXiv 2025
[9]

Longvie 2: Multimodal controllable ultra-long video world model.arXiv preprint arXiv:2512.13604, 2025

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, and Ziwei Liu. Longvie 2: Multimodal controllable ultra-long video world model.arXiv preprint arXiv:2512.13604, 2025

arXiv 2025
[10]

Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models.arXivpreprintarXiv:2406.10981, 2024

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, and Jun Xiao. Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models.arXivpreprintarXiv:2406.10981, 2024

arXiv 2024
[11]

Long context tuning for video generation.arXivpreprintarXiv:2503.10589, 2025

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhĳie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXivpreprintarXiv:2503.10589, 2025

arXiv 2025
[12]

World models.arXiv preprintarXiv:1803.10122, 2(3), 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprintarXiv:1803.10122, 2(3), 2018

Pith/arXiv arXiv 2018
[13]

Dream to control: Learning behaviors by latent imagination.arXiv preprintarXiv:1912.01603, 2019

Danĳar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprintarXiv:1912.01603, 2019

Pith/arXiv arXiv 1912
[14]

Cameractrl: Enabling camera control for video diffusion models

HaoHe,YinghaoXu,YuweiGuo,GordonWetzstein,BoDai,HongshengLi,andCeyuanYang. Cameractrl: Enabling camera control for video diffusion models. InICLR, 2025

2025
[15]

Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025
[16]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025
[17]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXivpreprintarXiv:2506.08009, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXivpreprintarXiv:2506.08009, 2025. 11

Pith/arXiv arXiv 2025
[18]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson PatternAnalysisandMachineIntelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson PatternAnalysisandMachineIntelligence, 2025

2025
[19]

World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation

Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprintarXiv:2509.19080, 2025

arXiv 2025
[20]

Pyramidal flow matching for efficient video generative modeling.arXivpreprintarXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXivpreprintarXiv:2410.05954, 2024

arXiv 2024
[21]

Adam: A method for stochastic optimization.arXivpreprintarXiv:1412.6980, 2014

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXivpreprintarXiv:1412.6980, 2014

Pith/arXiv arXiv 2014
[22]

Pathdreamer: A world model for indoor navigation

Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. InCVPR, 2021

2021
[23]

Videopoet: A large language model for zero-shot video generation

Dan Kondratyuk, Lĳun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprintarXiv:2312.14125, 2023

Pith/arXiv arXiv 2023
[24]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

WeĳieKong, QiTian, ZĳianZhang, RoxMin, ZuozhuoDai, JinZhou, JiangfengXiong, XinLi, BoWu, JianweiZhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[25]

Magicworld: Interactive geometry-driven video world exploration.arXiv preprintarXiv:2511.18886, 2025

GuangyuanLi,SimingZheng,ShuolinXu,JinweiChen,BoLi,XiaobinHu,LeiZhao,andPeng-TaoJiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprintarXiv:2511.18886, 2025

arXiv 2025
[26]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

arXiv 2025
[27]

Magicmotion: Controllablevideogeneration with dense-to-sparse trajectory guidance

QuanhaoLi,ZhenXing,RuiWang,HuiZhang,QiDai,andZuxuanWu. Magicmotion: Controllablevideogeneration with dense-to-sparse trajectory guidance. InICCV, 2025

2025
[28]

Sekai: A video dataset towards world exploration.arXivpreprintarXiv:2506.15675, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXivpreprintarXiv:2506.15675, 2025

arXiv 2025
[29]

Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

arXiv 2024
[30]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InECCV, 2024

2024
[31]

Rolling forcing: Autoregressive long video diffusion in real time.arXivpreprintarXiv:2509.25161, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shĳian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXivpreprintarXiv:2509.25161, 2025

Pith/arXiv arXiv 2025
[32]

Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXivpreprintarXiv:2512.04678, 2025

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXivpreprintarXiv:2512.04678, 2025

Pith/arXiv arXiv 2025
[33]

Yume: An interactive world generation model.arXivpreprintarXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXivpreprintarXiv:2507.17744, 2025

arXiv 2025
[34]

Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller,SanjaFidler,andJunGao. Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol. In CVPR, 2025

2025
[35]

Motionstream: Real-time video generation with interactive motion controls.arXivpreprintarXiv:2511.01266, 2025

JoonghyukShin,ZhengqiLi,RichardZhang,Jun-YanZhu,JaesikPark,EliShechtman,andXunHuang. Motionstream: Real-time video generation with interactive motion controls.arXivpreprintarXiv:2511.01266, 2025. 12

arXiv 2025
[36]

Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

Pith/arXiv arXiv 2024
[37]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXivpreprintarXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXivpreprintarXiv:2512.14614, 2025

Pith/arXiv arXiv 2025
[38]

Towards accurate generative models of video: A new metric & challenges.arXivpreprint arXiv:1812.01717, 2018

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXivpreprint arXiv:1812.01717, 2018

Pith/arXiv arXiv 2018
[39]

Streamllm: Enhancing constraint programming with large language model-generated streamliners

Florentina Voboril, Vaidyanathan Peruvemba Ramaswamy, and Stefan Szeider. Streamllm: Enhancing constraint programming with large language model-generated streamliners. In2025 IEEE/ACM1stInternational Workshop on Neuro-SymbolicSoftwareEngineering(NSE), 2025

2025
[40]

Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[41]

Omnitokenizer: A joint image-video tokenizer for visual generation

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

2024
[42]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

arXiv 2025
[43]

Omnigen-ar: Autoregressive any-to-image generation

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025

2025
[44]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACMSIGGRAPH 2024 Conference Papers, 2024

2024
[45]

Videogpt: Video generation using vq-vae and transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprintarXiv:2104.10157, 2021

Pith/arXiv arXiv 2021
[46]

Longlive: Real-time interactive long video generation.arXivpreprintarXiv:2509.22622, 2025

ShuaiYang,WeiHuang,RuihangChu,YichengXiao,YuyangZhao,XianbangWang,MuyangLi,EnzeXie,Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXivpreprintarXiv:2509.22622, 2025

Pith/arXiv arXiv 2025
[47]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

2025
[48]

Yan: Foundational interactive video generation.arXiv preprintarXiv:2508.08601, 2025

Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, et al. Yan: Foundational interactive video generation.arXiv preprintarXiv:2508.08601, 2025

arXiv 2025
[49]

Improved distribution matching distillation for fast image synthesis.NeurIPS, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.NeurIPS, 2024

2024
[50]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

2025
[51]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia2025 ConferencePapers, pages 1–11, 2025

2025
[52]

Gamefactory: Creating new games with generative interactive videos.arXivpreprintarXiv:2501.08325, 2025

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos.arXivpreprintarXiv:2501.08325, 2025

arXiv 2025
[53]

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

DavidJunhaoZhang,RoniPaiss,ShiranZada,NikhilKarnad,DavidEJacobs,YaelPritch,InbarMosseri,MikeZheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InCVPR, 2025

2025
[54]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In CVPR, 2023. 13

2023
[55]

Matrix-game: Interactive world foundation model.arXivpreprintarXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXivpreprintarXiv:2506.18701, 2025

arXiv 2025
[56]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

YangZhou,YifanWang,JianjunZhou,WenzhengChang,HaoyuGuo,ZizunLi,KaĳingMa,XinyueLi,YatingWang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

arXiv 2025
[57]

Astra: General interactive world model with autoregressive denoising.arXiv preprintarXiv:2512.08931, 2025

Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprintarXiv:2512.08931, 2025. 14 Appendix A More Experimental details All training stages are conducted on 16 NVIDIA H20 GPUs using the AdamW [21] optimizer. Teacher Model.The teacher m...

arXiv 2025

[1] [1]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, 2025

2025

[2] [2]

Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

arXiv 2025

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shĳie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprintarXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[4] [4]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InCVPR, 2025

2025

[5] [5]

Stable video diffusion: Scaling latent video diffusion models to large datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXivpreprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023

[6] [6]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 2024

2024

[7] [7]

Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

arXiv 2025

[8] [8]

Self-forcing++: Towards minute-scale high-quality video generation.arXivpreprintarXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXivpreprintarXiv:2510.02283, 2025

Pith/arXiv arXiv 2025

[9] [9]

Longvie 2: Multimodal controllable ultra-long video world model.arXiv preprint arXiv:2512.13604, 2025

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, and Ziwei Liu. Longvie 2: Multimodal controllable ultra-long video world model.arXiv preprint arXiv:2512.13604, 2025

arXiv 2025

[10] [10]

Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models.arXivpreprintarXiv:2406.10981, 2024

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, and Jun Xiao. Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models.arXivpreprintarXiv:2406.10981, 2024

arXiv 2024

[11] [11]

Long context tuning for video generation.arXivpreprintarXiv:2503.10589, 2025

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhĳie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXivpreprintarXiv:2503.10589, 2025

arXiv 2025

[12] [12]

World models.arXiv preprintarXiv:1803.10122, 2(3), 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprintarXiv:1803.10122, 2(3), 2018

Pith/arXiv arXiv 2018

[13] [13]

Dream to control: Learning behaviors by latent imagination.arXiv preprintarXiv:1912.01603, 2019

Danĳar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprintarXiv:1912.01603, 2019

Pith/arXiv arXiv 1912

[14] [14]

Cameractrl: Enabling camera control for video diffusion models

HaoHe,YinghaoXu,YuweiGuo,GordonWetzstein,BoDai,HongshengLi,andCeyuanYang. Cameractrl: Enabling camera control for video diffusion models. InICLR, 2025

2025

[15] [15]

Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025

[16] [16]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025

[17] [17]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXivpreprintarXiv:2506.08009, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXivpreprintarXiv:2506.08009, 2025. 11

Pith/arXiv arXiv 2025

[18] [18]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson PatternAnalysisandMachineIntelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson PatternAnalysisandMachineIntelligence, 2025

2025

[19] [19]

World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation

Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprintarXiv:2509.19080, 2025

arXiv 2025

[20] [20]

Pyramidal flow matching for efficient video generative modeling.arXivpreprintarXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXivpreprintarXiv:2410.05954, 2024

arXiv 2024

[21] [21]

Adam: A method for stochastic optimization.arXivpreprintarXiv:1412.6980, 2014

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXivpreprintarXiv:1412.6980, 2014

Pith/arXiv arXiv 2014

[22] [22]

Pathdreamer: A world model for indoor navigation

Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. InCVPR, 2021

2021

[23] [23]

Videopoet: A large language model for zero-shot video generation

Dan Kondratyuk, Lĳun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprintarXiv:2312.14125, 2023

Pith/arXiv arXiv 2023

[24] [24]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

WeĳieKong, QiTian, ZĳianZhang, RoxMin, ZuozhuoDai, JinZhou, JiangfengXiong, XinLi, BoWu, JianweiZhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[25] [25]

Magicworld: Interactive geometry-driven video world exploration.arXiv preprintarXiv:2511.18886, 2025

GuangyuanLi,SimingZheng,ShuolinXu,JinweiChen,BoLi,XiaobinHu,LeiZhao,andPeng-TaoJiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprintarXiv:2511.18886, 2025

arXiv 2025

[26] [26]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

arXiv 2025

[27] [27]

Magicmotion: Controllablevideogeneration with dense-to-sparse trajectory guidance

QuanhaoLi,ZhenXing,RuiWang,HuiZhang,QiDai,andZuxuanWu. Magicmotion: Controllablevideogeneration with dense-to-sparse trajectory guidance. InICCV, 2025

2025

[28] [28]

Sekai: A video dataset towards world exploration.arXivpreprintarXiv:2506.15675, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXivpreprintarXiv:2506.15675, 2025

arXiv 2025

[29] [29]

Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

arXiv 2024

[30] [30]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InECCV, 2024

2024

[31] [31]

Rolling forcing: Autoregressive long video diffusion in real time.arXivpreprintarXiv:2509.25161, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shĳian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXivpreprintarXiv:2509.25161, 2025

Pith/arXiv arXiv 2025

[32] [32]

Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXivpreprintarXiv:2512.04678, 2025

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXivpreprintarXiv:2512.04678, 2025

Pith/arXiv arXiv 2025

[33] [33]

Yume: An interactive world generation model.arXivpreprintarXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXivpreprintarXiv:2507.17744, 2025

arXiv 2025

[34] [34]

Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller,SanjaFidler,andJunGao. Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol. In CVPR, 2025

2025

[35] [35]

Motionstream: Real-time video generation with interactive motion controls.arXivpreprintarXiv:2511.01266, 2025

JoonghyukShin,ZhengqiLi,RichardZhang,Jun-YanZhu,JaesikPark,EliShechtman,andXunHuang. Motionstream: Real-time video generation with interactive motion controls.arXivpreprintarXiv:2511.01266, 2025. 12

arXiv 2025

[36] [36]

Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

Pith/arXiv arXiv 2024

[37] [37]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXivpreprintarXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXivpreprintarXiv:2512.14614, 2025

Pith/arXiv arXiv 2025

[38] [38]

Towards accurate generative models of video: A new metric & challenges.arXivpreprint arXiv:1812.01717, 2018

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXivpreprint arXiv:1812.01717, 2018

Pith/arXiv arXiv 2018

[39] [39]

Streamllm: Enhancing constraint programming with large language model-generated streamliners

Florentina Voboril, Vaidyanathan Peruvemba Ramaswamy, and Stefan Szeider. Streamllm: Enhancing constraint programming with large language model-generated streamliners. In2025 IEEE/ACM1stInternational Workshop on Neuro-SymbolicSoftwareEngineering(NSE), 2025

2025

[40] [40]

Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[41] [41]

Omnitokenizer: A joint image-video tokenizer for visual generation

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

2024

[42] [42]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

arXiv 2025

[43] [43]

Omnigen-ar: Autoregressive any-to-image generation

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025

2025

[44] [44]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACMSIGGRAPH 2024 Conference Papers, 2024

2024

[45] [45]

Videogpt: Video generation using vq-vae and transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprintarXiv:2104.10157, 2021

Pith/arXiv arXiv 2021

[46] [46]

Longlive: Real-time interactive long video generation.arXivpreprintarXiv:2509.22622, 2025

ShuaiYang,WeiHuang,RuihangChu,YichengXiao,YuyangZhao,XianbangWang,MuyangLi,EnzeXie,Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXivpreprintarXiv:2509.22622, 2025

Pith/arXiv arXiv 2025

[47] [47]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

2025

[48] [48]

Yan: Foundational interactive video generation.arXiv preprintarXiv:2508.08601, 2025

Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, et al. Yan: Foundational interactive video generation.arXiv preprintarXiv:2508.08601, 2025

arXiv 2025

[49] [49]

Improved distribution matching distillation for fast image synthesis.NeurIPS, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.NeurIPS, 2024

2024

[50] [50]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

2025

[51] [51]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia2025 ConferencePapers, pages 1–11, 2025

2025

[52] [52]

Gamefactory: Creating new games with generative interactive videos.arXivpreprintarXiv:2501.08325, 2025

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos.arXivpreprintarXiv:2501.08325, 2025

arXiv 2025

[53] [53]

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

DavidJunhaoZhang,RoniPaiss,ShiranZada,NikhilKarnad,DavidEJacobs,YaelPritch,InbarMosseri,MikeZheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InCVPR, 2025

2025

[54] [54]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In CVPR, 2023. 13

2023

[55] [55]

Matrix-game: Interactive world foundation model.arXivpreprintarXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXivpreprintarXiv:2506.18701, 2025

arXiv 2025

[56] [56]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

YangZhou,YifanWang,JianjunZhou,WenzhengChang,HaoyuGuo,ZizunLi,KaĳingMa,XinyueLi,YatingWang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

arXiv 2025

[57] [57]

Astra: General interactive world model with autoregressive denoising.arXiv preprintarXiv:2512.08931, 2025

Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprintarXiv:2512.08931, 2025. 14 Appendix A More Experimental details All training stages are conducted on 16 NVIDIA H20 GPUs using the AdamW [21] optimizer. Teacher Model.The teacher m...

arXiv 2025