pith. sign in

arxiv: 2606.07967 · v1 · pith:EPOZIZZUnew · submitted 2026-06-06 · 💻 cs.CV

DisCo: World Models with Discrete Camera Motion Control

Pith reviewed 2026-06-27 20:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords controllable video generationworld modelsdiscrete action primitivescamera motion controlaction separabilityvideo synthesisDisCoBench
0
0 comments X

The pith

Discrete camera motion primitives replace continuous trajectories to improve action controllability in video world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most controllable video world models condition generation on continuous camera trajectories, but this produces high feature similarity across different motion patterns and weakens reliable action following. The paper identifies this entanglement as the core bottleneck and replaces it with conditioning on a compact set of discrete action primitives. The resulting model, DisCo, is evaluated on DisCoBench, which tests short-term, long-horizon, and highly dynamic camera control. Experiments show that the discrete representation yields more faithful execution of explicit commands while preserving visual quality and temporal coherence.

Core claim

Continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Conditioning generation instead on a compact set of discrete action primitives improves action separability, enabling significantly more reliable action following in controllable video generation while preserving visual quality.

What carries the argument

DisCo conditions video generation on a compact set of discrete action primitives chosen to increase separability between distinct camera motions.

If this is right

  • Explicit action commands become more faithfully executed across complex and extended motion sequences.
  • Action separability improves without trade-off in visual quality or coherence.
  • Standardized testing of controllability becomes possible via DisCoBench for short-term, long-horizon, and dynamic cases.
  • World models can support interactive exploration with explicit discrete controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same discretization principle might reduce action-space complexity in other generative control tasks.
  • Hierarchical or learned primitives could become necessary if the fixed set proves insufficient for certain domains.
  • Training efficiency may increase because the model no longer needs to distinguish near-identical continuous trajectories.

Load-bearing premise

A small fixed set of discrete primitives can express the full range of complex, long-horizon camera motions required for realistic world exploration.

What would settle it

A demonstration that some realistic long camera sequence cannot be formed from the discrete primitives without measurable loss in visual fidelity or temporal coherence would falsify the central claim.

read the original abstract

Controllable video world models target interactive world exploration, where models must faithfully execute explicit action commands while preserving visual quality and temporal coherence. However, most existing approaches rely on continuous camera trajectories as action conditions, which often lead to unreliable action following, especially under complex motion sequences. In this work, we identify action representation entanglement as a key bottleneck in controllable video generation, and show that continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Based on this insight, we propose DisCo, a controllable video world model that conditions generation on a compact set of discrete action primitives to improve action separability. We further introduce DisCoBench, a comprehensive benchmark for evaluating the ability of models in short-term, long-horizon, and highly dynamic exploration scenarios. Extensive experiments demonstrate that DisCo achieves significantly more reliable action following while preserving visual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies action representation entanglement in continuous camera trajectories as a key bottleneck for controllable video world models, where distinct motion patterns exhibit high feature similarity that degrades action controllability. It proposes DisCo, a model conditioned on a compact set of discrete action primitives to improve separability, introduces the DisCoBench benchmark covering short-term, long-horizon, and highly dynamic scenarios, and reports that experiments demonstrate significantly more reliable action following while preserving visual quality and temporal coherence.

Significance. If the empirical results hold under rigorous controls, the work would provide a concrete demonstration that discrete primitives can mitigate entanglement issues in action-conditioned video generation, offering a practical alternative for interactive world models. The introduction of DisCoBench as a standardized evaluation suite for long-horizon camera control would also be a useful community resource, particularly if the benchmark protocols are reproducible and falsifiable.

major comments (2)
  1. [DisCoBench and Experiments sections] The central claim that discrete primitives remain sufficiently expressive for complex, long-horizon camera motions (the weakest assumption noted in the stress-test) is load-bearing but receives no quantitative coverage analysis or ablation in the provided description of DisCoBench or the experimental protocol; without this, it is unclear whether the reported gains generalize beyond the tested scenarios.
  2. [Introduction / Motivation] The identification of 'high feature similarity across distinct motion patterns' in continuous representations is asserted as the bottleneck, yet the abstract and summary provide no derivation, similarity metric, or quantitative comparison (e.g., cosine similarity or mutual information between features) to support the entanglement diagnosis before proposing the discrete fix.
minor comments (2)
  1. Notation for the discrete action primitives and how they are selected or learned should be clarified with a formal definition or pseudocode to aid reproducibility.
  2. [Abstract] The abstract states 'extensive experiments' but does not preview key metrics (e.g., action accuracy, FID, or temporal consistency scores) or baseline comparisons; adding one sentence would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: The identification of 'high feature similarity across distinct motion patterns' in continuous representations is asserted as the bottleneck, yet the abstract and summary provide no derivation, similarity metric, or quantitative comparison (e.g., cosine similarity or mutual information between features) to support the entanglement diagnosis before proposing the discrete fix.

    Authors: The full manuscript provides a quantitative analysis of feature similarity (using cosine similarity) between distinct motion patterns under continuous representations in the motivation section to support the entanglement diagnosis. However, we acknowledge that the abstract is concise and does not include these details. We will revise the abstract to briefly reference the quantitative evidence, improving clarity without changing the claims. revision: yes

  2. Referee: The central claim that discrete primitives remain sufficiently expressive for complex, long-horizon camera motions (the weakest assumption noted in the stress-test) is load-bearing but receives no quantitative coverage analysis or ablation in the provided description of DisCoBench or the experimental protocol; without this, it is unclear whether the reported gains generalize beyond the tested scenarios.

    Authors: We agree this is a valid concern and that the current manuscript lacks explicit quantitative coverage analysis or ablation for the expressiveness of discrete primitives in long-horizon scenarios. We will add such an analysis and ablation study to the DisCoBench and Experiments sections in the revision, including metrics on motion coverage to demonstrate sufficiency for the tested scenarios. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation begins from an empirical observation about feature similarity in continuous camera representations and proposes discrete action primitives as an alternative to improve separability. No equations, fitted parameters, or self-citations are presented that reduce this insight or the resulting model to a self-referential definition or construction. The central claim rests on an identified bottleneck and a proposed architectural change, with no load-bearing steps that equate predictions to inputs by construction. The argument remains self-contained without invoking uniqueness theorems, ansatzes smuggled via citation, or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only abstract available; ledger populated from stated assumptions in the text.

axioms (1)
  • domain assumption Discrete action primitives suffice to represent the range of camera motions needed for short-term, long-horizon, and dynamic exploration without expressiveness loss.
    Invoked when proposing the method as a replacement for continuous trajectories.
invented entities (2)
  • DisCo model no independent evidence
    purpose: Video world model conditioned on discrete action primitives
    New architecture introduced to implement the discrete-control idea.
  • DisCoBench no independent evidence
    purpose: Benchmark for short-term, long-horizon, and highly dynamic camera control evaluation
    New evaluation suite released with the method.

pith-pipeline@v0.9.1-grok · 5683 in / 1182 out tokens · 19768 ms · 2026-06-27T20:19:35.208353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 19 linked inside Pith

  1. [1]

    Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, 2025

  2. [2]

    Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprintarXiv:2502.13923, 2025

  4. [4]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InCVPR, 2025

  5. [5]

    Stable video diffusion: Scaling latent video diffusion models to large datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXivpreprint arXiv:2311.15127, 2023

  6. [6]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 2024

  7. [7]

    Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

    Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

  8. [8]

    Self-forcing++: Towards minute-scale high-quality video generation.arXivpreprintarXiv:2510.02283, 2025

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXivpreprintarXiv:2510.02283, 2025

  9. [9]

    Longvie 2: Multimodal controllable ultra-long video world model.arXiv preprint arXiv:2512.13604, 2025

    Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, and Ziwei Liu. Longvie 2: Multimodal controllable ultra-long video world model.arXiv preprint arXiv:2512.13604, 2025

  10. [10]

    Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models.arXivpreprintarXiv:2406.10981, 2024

    Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, and Jun Xiao. Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models.arXivpreprintarXiv:2406.10981, 2024

  11. [11]

    Long context tuning for video generation.arXivpreprintarXiv:2503.10589, 2025

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXivpreprintarXiv:2503.10589, 2025

  12. [12]

    World models.arXiv preprintarXiv:1803.10122, 2(3), 2018

    David Ha and Jürgen Schmidhuber. World models.arXiv preprintarXiv:1803.10122, 2(3), 2018

  13. [13]

    Dream to control: Learning behaviors by latent imagination.arXiv preprintarXiv:1912.01603, 2019

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprintarXiv:1912.01603, 2019

  14. [14]

    Cameractrl: Enabling camera control for video diffusion models

    HaoHe,YinghaoXu,YuweiGuo,GordonWetzstein,BoDai,HongshengLi,andCeyuanYang. Cameractrl: Enabling camera control for video diffusion models. InICLR, 2025

  15. [15]

    Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  16. [16]

    Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

  17. [17]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXivpreprintarXiv:2506.08009, 2025

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXivpreprintarXiv:2506.08009, 2025. 11

  18. [18]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson PatternAnalysisandMachineIntelligence, 2025

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson PatternAnalysisandMachineIntelligence, 2025

  19. [19]

    World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation

    Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprintarXiv:2509.19080, 2025

  20. [20]

    Pyramidal flow matching for efficient video generative modeling.arXivpreprintarXiv:2410.05954, 2024

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXivpreprintarXiv:2410.05954, 2024

  21. [21]

    Adam: A method for stochastic optimization.arXivpreprintarXiv:1412.6980, 2014

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXivpreprintarXiv:1412.6980, 2014

  22. [22]

    Pathdreamer: A world model for indoor navigation

    Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. InCVPR, 2021

  23. [23]

    Videopoet: A large language model for zero-shot video generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprintarXiv:2312.14125, 2023

  24. [24]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    WeijieKong, QiTian, ZijianZhang, RoxMin, ZuozhuoDai, JinZhou, JiangfengXiong, XinLi, BoWu, JianweiZhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  25. [25]

    Magicworld: Interactive geometry-driven video world exploration.arXiv preprintarXiv:2511.18886, 2025

    GuangyuanLi,SimingZheng,ShuolinXu,JinweiChen,BoLi,XiaobinHu,LeiZhao,andPeng-TaoJiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprintarXiv:2511.18886, 2025

  26. [26]

    Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

    Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

  27. [27]

    Magicmotion: Controllablevideogeneration with dense-to-sparse trajectory guidance

    QuanhaoLi,ZhenXing,RuiWang,HuiZhang,QiDai,andZuxuanWu. Magicmotion: Controllablevideogeneration with dense-to-sparse trajectory guidance. InICCV, 2025

  28. [28]

    Sekai: A video dataset towards world exploration.arXivpreprintarXiv:2506.15675, 2025

    Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXivpreprintarXiv:2506.15675, 2025

  29. [29]

    Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

    Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

  30. [30]

    Evaluating text-to-visual generation with image-to-text generation

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InECCV, 2024

  31. [31]

    Rolling forcing: Autoregressive long video diffusion in real time.arXivpreprintarXiv:2509.25161, 2025

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXivpreprintarXiv:2509.25161, 2025

  32. [32]

    Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXivpreprintarXiv:2512.04678, 2025

    Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXivpreprintarXiv:2512.04678, 2025

  33. [33]

    Yume: An interactive world generation model.arXivpreprintarXiv:2507.17744, 2025

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXivpreprintarXiv:2507.17744, 2025

  34. [34]

    Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller,SanjaFidler,andJunGao. Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol. In CVPR, 2025

  35. [35]

    Motionstream: Real-time video generation with interactive motion controls.arXivpreprintarXiv:2511.01266, 2025

    JoonghyukShin,ZhengqiLi,RichardZhang,Jun-YanZhu,JaesikPark,EliShechtman,andXunHuang. Motionstream: Real-time video generation with interactive motion controls.arXivpreprintarXiv:2511.01266, 2025. 12

  36. [36]

    Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

  37. [37]

    Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXivpreprintarXiv:2512.14614, 2025

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXivpreprintarXiv:2512.14614, 2025

  38. [38]

    Towards accurate generative models of video: A new metric & challenges.arXivpreprint arXiv:1812.01717, 2018

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXivpreprint arXiv:1812.01717, 2018

  39. [39]

    Streamllm: Enhancing constraint programming with large language model-generated streamliners

    Florentina Voboril, Vaidyanathan Peruvemba Ramaswamy, and Stefan Szeider. Streamllm: Enhancing constraint programming with large language model-generated streamliners. In2025 IEEE/ACM1stInternational Workshop on Neuro-SymbolicSoftwareEngineering(NSE), 2025

  40. [40]

    Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

  41. [41]

    Omnitokenizer: A joint image-video tokenizer for visual generation

    Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

  42. [42]

    Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

    Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

  43. [43]

    Omnigen-ar: Autoregressive any-to-image generation

    Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025

  44. [44]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACMSIGGRAPH 2024 Conference Papers, 2024

  45. [45]

    Videogpt: Video generation using vq-vae and transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprintarXiv:2104.10157, 2021

  46. [46]

    Longlive: Real-time interactive long video generation.arXivpreprintarXiv:2509.22622, 2025

    ShuaiYang,WeiHuang,RuihangChu,YichengXiao,YuyangZhao,XianbangWang,MuyangLi,EnzeXie,Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXivpreprintarXiv:2509.22622, 2025

  47. [47]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

  48. [48]

    Yan: Foundational interactive video generation.arXiv preprintarXiv:2508.08601, 2025

    Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, et al. Yan: Foundational interactive video generation.arXiv preprintarXiv:2508.08601, 2025

  49. [49]

    Improved distribution matching distillation for fast image synthesis.NeurIPS, 2024

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.NeurIPS, 2024

  50. [50]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

  51. [51]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia2025 ConferencePapers, pages 1–11, 2025

  52. [52]

    Gamefactory: Creating new games with generative interactive videos.arXivpreprintarXiv:2501.08325, 2025

    Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos.arXivpreprintarXiv:2501.08325, 2025

  53. [53]

    Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

    DavidJunhaoZhang,RoniPaiss,ShiranZada,NikhilKarnad,DavidEJacobs,YaelPritch,InbarMosseri,MikeZheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InCVPR, 2025

  54. [54]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In CVPR, 2023. 13

  55. [55]

    Matrix-game: Interactive world foundation model.arXivpreprintarXiv:2506.18701, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXivpreprintarXiv:2506.18701, 2025

  56. [56]

    Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

    YangZhou,YifanWang,JianjunZhou,WenzhengChang,HaoyuGuo,ZizunLi,KaijingMa,XinyueLi,YatingWang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

  57. [57]

    Astra: General interactive world model with autoregressive denoising.arXiv preprintarXiv:2512.08931, 2025

    Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprintarXiv:2512.08931, 2025. 14 Appendix A More Experimental details All training stages are conducted on 16 NVIDIA H20 GPUs using the AdamW [21] optimizer. Teacher Model.The teacher m...