DisCo: World Models with Discrete Camera Motion Control
Pith reviewed 2026-06-27 20:19 UTC · model grok-4.3
The pith
Discrete camera motion primitives replace continuous trajectories to improve action controllability in video world models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Conditioning generation instead on a compact set of discrete action primitives improves action separability, enabling significantly more reliable action following in controllable video generation while preserving visual quality.
What carries the argument
DisCo conditions video generation on a compact set of discrete action primitives chosen to increase separability between distinct camera motions.
If this is right
- Explicit action commands become more faithfully executed across complex and extended motion sequences.
- Action separability improves without trade-off in visual quality or coherence.
- Standardized testing of controllability becomes possible via DisCoBench for short-term, long-horizon, and dynamic cases.
- World models can support interactive exploration with explicit discrete controls.
Where Pith is reading between the lines
- The same discretization principle might reduce action-space complexity in other generative control tasks.
- Hierarchical or learned primitives could become necessary if the fixed set proves insufficient for certain domains.
- Training efficiency may increase because the model no longer needs to distinguish near-identical continuous trajectories.
Load-bearing premise
A small fixed set of discrete primitives can express the full range of complex, long-horizon camera motions required for realistic world exploration.
What would settle it
A demonstration that some realistic long camera sequence cannot be formed from the discrete primitives without measurable loss in visual fidelity or temporal coherence would falsify the central claim.
read the original abstract
Controllable video world models target interactive world exploration, where models must faithfully execute explicit action commands while preserving visual quality and temporal coherence. However, most existing approaches rely on continuous camera trajectories as action conditions, which often lead to unreliable action following, especially under complex motion sequences. In this work, we identify action representation entanglement as a key bottleneck in controllable video generation, and show that continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Based on this insight, we propose DisCo, a controllable video world model that conditions generation on a compact set of discrete action primitives to improve action separability. We further introduce DisCoBench, a comprehensive benchmark for evaluating the ability of models in short-term, long-horizon, and highly dynamic exploration scenarios. Extensive experiments demonstrate that DisCo achieves significantly more reliable action following while preserving visual quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies action representation entanglement in continuous camera trajectories as a key bottleneck for controllable video world models, where distinct motion patterns exhibit high feature similarity that degrades action controllability. It proposes DisCo, a model conditioned on a compact set of discrete action primitives to improve separability, introduces the DisCoBench benchmark covering short-term, long-horizon, and highly dynamic scenarios, and reports that experiments demonstrate significantly more reliable action following while preserving visual quality and temporal coherence.
Significance. If the empirical results hold under rigorous controls, the work would provide a concrete demonstration that discrete primitives can mitigate entanglement issues in action-conditioned video generation, offering a practical alternative for interactive world models. The introduction of DisCoBench as a standardized evaluation suite for long-horizon camera control would also be a useful community resource, particularly if the benchmark protocols are reproducible and falsifiable.
major comments (2)
- [DisCoBench and Experiments sections] The central claim that discrete primitives remain sufficiently expressive for complex, long-horizon camera motions (the weakest assumption noted in the stress-test) is load-bearing but receives no quantitative coverage analysis or ablation in the provided description of DisCoBench or the experimental protocol; without this, it is unclear whether the reported gains generalize beyond the tested scenarios.
- [Introduction / Motivation] The identification of 'high feature similarity across distinct motion patterns' in continuous representations is asserted as the bottleneck, yet the abstract and summary provide no derivation, similarity metric, or quantitative comparison (e.g., cosine similarity or mutual information between features) to support the entanglement diagnosis before proposing the discrete fix.
minor comments (2)
- Notation for the discrete action primitives and how they are selected or learned should be clarified with a formal definition or pseudocode to aid reproducibility.
- [Abstract] The abstract states 'extensive experiments' but does not preview key metrics (e.g., action accuracy, FID, or temporal consistency scores) or baseline comparisons; adding one sentence would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: The identification of 'high feature similarity across distinct motion patterns' in continuous representations is asserted as the bottleneck, yet the abstract and summary provide no derivation, similarity metric, or quantitative comparison (e.g., cosine similarity or mutual information between features) to support the entanglement diagnosis before proposing the discrete fix.
Authors: The full manuscript provides a quantitative analysis of feature similarity (using cosine similarity) between distinct motion patterns under continuous representations in the motivation section to support the entanglement diagnosis. However, we acknowledge that the abstract is concise and does not include these details. We will revise the abstract to briefly reference the quantitative evidence, improving clarity without changing the claims. revision: yes
-
Referee: The central claim that discrete primitives remain sufficiently expressive for complex, long-horizon camera motions (the weakest assumption noted in the stress-test) is load-bearing but receives no quantitative coverage analysis or ablation in the provided description of DisCoBench or the experimental protocol; without this, it is unclear whether the reported gains generalize beyond the tested scenarios.
Authors: We agree this is a valid concern and that the current manuscript lacks explicit quantitative coverage analysis or ablation for the expressiveness of discrete primitives in long-horizon scenarios. We will add such an analysis and ablation study to the DisCoBench and Experiments sections in the revision, including metrics on motion coverage to demonstrate sufficiency for the tested scenarios. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's derivation begins from an empirical observation about feature similarity in continuous camera representations and proposes discrete action primitives as an alternative to improve separability. No equations, fitted parameters, or self-citations are presented that reduce this insight or the resulting model to a self-referential definition or construction. The central claim rests on an identified bottleneck and a proposed architectural change, with no load-bearing steps that equate predictions to inputs by construction. The argument remains self-contained without invoking uniqueness theorems, ansatzes smuggled via citation, or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Discrete action primitives suffice to represent the range of camera motions needed for short-term, long-horizon, and dynamic exploration without expressiveness loss.
invented entities (2)
-
DisCo model
no independent evidence
-
DisCoBench
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ac3d: Analyzing and improving 3d camera control in video diffusion transformers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, 2025
2025
-
[2]
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025
arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprintarXiv:2502.13923, 2025
Pith/arXiv arXiv 2025
-
[4]
Navigation world models
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InCVPR, 2025
2025
-
[5]
Stable video diffusion: Scaling latent video diffusion models to large datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXivpreprint arXiv:2311.15127, 2023
Pith/arXiv arXiv 2023
-
[6]
Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 2024
Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.NeurIPS, 2024
2024
-
[7]
Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025
Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025
arXiv 2025
-
[8]
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXivpreprintarXiv:2510.02283, 2025
Pith/arXiv arXiv 2025
-
[9]
Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, and Ziwei Liu. Longvie 2: Multimodal controllable ultra-long video world model.arXiv preprint arXiv:2512.13604, 2025
arXiv 2025
-
[10]
Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, and Jun Xiao. Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models.arXivpreprintarXiv:2406.10981, 2024
arXiv 2024
-
[11]
Long context tuning for video generation.arXivpreprintarXiv:2503.10589, 2025
Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXivpreprintarXiv:2503.10589, 2025
arXiv 2025
-
[12]
World models.arXiv preprintarXiv:1803.10122, 2(3), 2018
David Ha and Jürgen Schmidhuber. World models.arXiv preprintarXiv:1803.10122, 2(3), 2018
Pith/arXiv arXiv 2018
-
[13]
Dream to control: Learning behaviors by latent imagination.arXiv preprintarXiv:1912.01603, 2019
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprintarXiv:1912.01603, 2019
Pith/arXiv arXiv 1912
-
[14]
Cameractrl: Enabling camera control for video diffusion models
HaoHe,YinghaoXu,YuweiGuo,GordonWetzstein,BoDai,HongshengLi,andCeyuanYang. Cameractrl: Enabling camera control for video diffusion models. InICLR, 2025
2025
-
[15]
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025
Pith/arXiv arXiv 2025
-
[16]
Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025
Pith/arXiv arXiv 2025
-
[17]
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXivpreprintarXiv:2506.08009, 2025. 11
Pith/arXiv arXiv 2025
-
[18]
Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson PatternAnalysisandMachineIntelligence, 2025
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactionson PatternAnalysisandMachineIntelligence, 2025
2025
-
[19]
Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprintarXiv:2509.19080, 2025
arXiv 2025
-
[20]
Pyramidal flow matching for efficient video generative modeling.arXivpreprintarXiv:2410.05954, 2024
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXivpreprintarXiv:2410.05954, 2024
arXiv 2024
-
[21]
Adam: A method for stochastic optimization.arXivpreprintarXiv:1412.6980, 2014
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXivpreprintarXiv:1412.6980, 2014
Pith/arXiv arXiv 2014
-
[22]
Pathdreamer: A world model for indoor navigation
Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. InCVPR, 2021
2021
-
[23]
Videopoet: A large language model for zero-shot video generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprintarXiv:2312.14125, 2023
Pith/arXiv arXiv 2023
-
[24]
WeijieKong, QiTian, ZijianZhang, RoxMin, ZuozhuoDai, JinZhou, JiangfengXiong, XinLi, BoWu, JianweiZhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
Pith/arXiv arXiv 2024
-
[25]
Magicworld: Interactive geometry-driven video world exploration.arXiv preprintarXiv:2511.18886, 2025
GuangyuanLi,SimingZheng,ShuolinXu,JinweiChen,BoLi,XiaobinHu,LeiZhao,andPeng-TaoJiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprintarXiv:2511.18886, 2025
arXiv 2025
-
[26]
Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025
arXiv 2025
-
[27]
Magicmotion: Controllablevideogeneration with dense-to-sparse trajectory guidance
QuanhaoLi,ZhenXing,RuiWang,HuiZhang,QiDai,andZuxuanWu. Magicmotion: Controllablevideogeneration with dense-to-sparse trajectory guidance. InICCV, 2025
2025
-
[28]
Sekai: A video dataset towards world exploration.arXivpreprintarXiv:2506.15675, 2025
Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXivpreprintarXiv:2506.15675, 2025
arXiv 2025
-
[29]
Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024
arXiv 2024
-
[30]
Evaluating text-to-visual generation with image-to-text generation
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InECCV, 2024
2024
-
[31]
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXivpreprintarXiv:2509.25161, 2025
Pith/arXiv arXiv 2025
-
[32]
Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXivpreprintarXiv:2512.04678, 2025
Pith/arXiv arXiv 2025
-
[33]
Yume: An interactive world generation model.arXivpreprintarXiv:2507.17744, 2025
Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXivpreprintarXiv:2507.17744, 2025
arXiv 2025
-
[34]
Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol
Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller,SanjaFidler,andJunGao. Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol. In CVPR, 2025
2025
-
[35]
JoonghyukShin,ZhengqiLi,RichardZhang,Jun-YanZhu,JaesikPark,EliShechtman,andXunHuang. Motionstream: Real-time video generation with interactive motion controls.arXivpreprintarXiv:2511.01266, 2025. 12
arXiv 2025
-
[36]
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024
Pith/arXiv arXiv 2024
-
[37]
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXivpreprintarXiv:2512.14614, 2025
Pith/arXiv arXiv 2025
-
[38]
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXivpreprint arXiv:1812.01717, 2018
Pith/arXiv arXiv 2018
-
[39]
Streamllm: Enhancing constraint programming with large language model-generated streamliners
Florentina Voboril, Vaidyanathan Peruvemba Ramaswamy, and Stefan Szeider. Streamllm: Enhancing constraint programming with large language model-generated streamliners. In2025 IEEE/ACM1stInternational Workshop on Neuro-SymbolicSoftwareEngineering(NSE), 2025
2025
-
[40]
Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[41]
Omnitokenizer: A joint image-video tokenizer for visual generation
Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024
2024
-
[42]
Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025
arXiv 2025
-
[43]
Omnigen-ar: Autoregressive any-to-image generation
Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025
2025
-
[44]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACMSIGGRAPH 2024 Conference Papers, 2024
2024
-
[45]
Videogpt: Video generation using vq-vae and transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprintarXiv:2104.10157, 2021
Pith/arXiv arXiv 2021
-
[46]
Longlive: Real-time interactive long video generation.arXivpreprintarXiv:2509.22622, 2025
ShuaiYang,WeiHuang,RuihangChu,YichengXiao,YuyangZhao,XianbangWang,MuyangLi,EnzeXie,Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXivpreprintarXiv:2509.22622, 2025
Pith/arXiv arXiv 2025
-
[47]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025
2025
-
[48]
Yan: Foundational interactive video generation.arXiv preprintarXiv:2508.08601, 2025
Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, et al. Yan: Foundational interactive video generation.arXiv preprintarXiv:2508.08601, 2025
arXiv 2025
-
[49]
Improved distribution matching distillation for fast image synthesis.NeurIPS, 2024
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.NeurIPS, 2024
2024
-
[50]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025
2025
-
[51]
Context as memory: Scene-consistent interactive long video generation with memory retrieval
Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia2025 ConferencePapers, pages 1–11, 2025
2025
-
[52]
Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos.arXivpreprintarXiv:2501.08325, 2025
arXiv 2025
-
[53]
Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning
DavidJunhaoZhang,RoniPaiss,ShiranZada,NikhilKarnad,DavidEJacobs,YaelPritch,InbarMosseri,MikeZheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InCVPR, 2025
2025
-
[54]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In CVPR, 2023. 13
2023
-
[55]
Matrix-game: Interactive world foundation model.arXivpreprintarXiv:2506.18701, 2025
Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXivpreprintarXiv:2506.18701, 2025
arXiv 2025
-
[56]
YangZhou,YifanWang,JianjunZhou,WenzhengChang,HaoyuGuo,ZizunLi,KaijingMa,XinyueLi,YatingWang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025
arXiv 2025
-
[57]
Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprintarXiv:2512.08931, 2025. 14 Appendix A More Experimental details All training stages are conducted on 16 NVIDIA H20 GPUs using the AdamW [21] optimizer. Teacher Model.The teacher m...
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.