pith. sign in

arxiv: 2607.02517 · v1 · pith:3EMASAI6new · submitted 2026-07-02 · 💻 cs.CV

WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

Pith reviewed 2026-07-03 14:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords world modelvideo generationpersistent memorycontrollable simulationLLM trajectory planningdynamic objectsappearance stability3D control signals
0
0 comments X

The pith

WorldDirector decouples LLM-planned 3D trajectories from video rendering to preserve exact object identities after long absences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WorldDirector as a controllable video world model that separates semantic motion planning from pixel-level rendering. An LLM first generates 3D object trajectories and camera paths; these paths then act as explicit control signals fed to a video generator. The separation is intended to enforce physical consistency and keep the visual appearance of every dynamic object unchanged, even after the object leaves the frame for many frames and later returns. A reader would care because prior world models tie motion directly to visual observation and therefore lose object identity when continuous viewing is interrupted.

Core claim

WorldDirector explicitly decouples semantic motion orchestration from visual generation by leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, thereby ensuring strict physical logic and appearance stability while preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view.

What carries the argument

Decoupling of semantic motion orchestration from visual generation, with LLM-generated 3D trajectories serving as control signals to a downstream video generator.

If this is right

  • The method supports synthesis of complex and extended events.
  • It delivers controllability over both object motion and camera viewpoint.
  • It maintains persistent dynamic object memory across long absences from view.
  • It enables unrestricted viewpoint exploration in the generated world.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of planning from rendering could let users swap in different video generators while keeping the same trajectory planner.
  • Benchmarks that insert long temporal gaps between an object's appearances would directly test the claimed identity preservation.
  • The same trajectory-control pattern might be applied to non-video simulators such as physics engines if the control interface can be standardized.

Load-bearing premise

The LLM-generated 3D trajectories and camera paths can be converted into control signals that a downstream video generator follows without violating appearance consistency or physical plausibility.

What would settle it

Create a test sequence in which a moving object exits the camera view for many frames then re-enters; measure whether its rendered appearance matches its earlier appearance to within the same pixel-level identity metric used by the authors.

Figures

Figures reproduced from arXiv: 2607.02517 by Hanlin Wang, Hao Ouyang, Ka Leong Cheng, Qifeng Chen, Qingyan Bai, Qiuyu Wang, Wen Wang, Yanhong Zeng, Yihao Meng, Yixuan Li, Yue Yu, Yujun Shen, Zichen Liu.

Figure 1
Figure 1. Figure 1: Controllable world simulation with persistent dynamic memory via WorldDirector. By decoupling 3D semantic orchestration from latent video synthesis, our framework autoregressively generates long-horizon videos via causal chunks, ensuring rigorous dynamic memory and object permanence. Please refer to the video results on our project page for intuitive demonstrations. Abstract We present WorldDirector, a hig… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of WorldDirector. An LLM orchestrates 3D trajectories that are projected into 2D Location Conditions for causal chunk generation. Location (B) and Appearance (A) conditions are channel-concatenated with the noisy latent, while historical Context (M) is sequence-concatenated. During generation, temporal drop is applied and an asymmetric attention routing prevents noise from polluting the context me… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with baselines. Note that HyDRA uses the initial 10s of our results as a reference video for its generation. Please refer to the video results on our project page for intuitive demonstrations [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on Appearance Condition. We conduct experiments on a case involving complex [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A generation example of Promptable World Events. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results demonstrate that our method supports the synthesis of complex and extended events with unprecedented controllability and persistent dynamic object memory. Project Page: https://worlddirector.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents WorldDirector, a controllable video world model that decouples semantic motion orchestration (via LLM coordination of 3D trajectories and camera movements) from visual generation. These trajectories serve as control signals to a downstream video generator, with the explicit goal of enforcing strict physical logic, appearance stability, and persistent dynamic object memory so that exact visual identities of entities are preserved even after prolonged absences from the scene. The abstract reports that this enables synthesis of complex, extended events with high controllability.

Significance. If the claimed decoupling and control mechanism succeed in delivering pixel-level identity persistence and physical consistency without continuous observation, the work would address a core limitation in current video world models and generative simulators, with potential impact on applications requiring long-horizon scene consistency such as robotics planning and interactive media.

major comments (1)
  1. [Abstract] Abstract: the claim that the method 'ensures strict physical logic and appearance stability' and 'successfully preserv[es] the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view' is load-bearing for the central contribution, yet the abstract supplies no description of the video generator architecture, identity conditioning (e.g., embeddings, memory banks, or reference frames), or mechanism by which trajectory controls alone prevent deviation during long occlusions.
minor comments (1)
  1. No quantitative metrics, ablation studies, or implementation details are referenced in the provided abstract to allow assessment of the claimed improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address the concern point-by-point below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'ensures strict physical logic and appearance stability' and 'successfully preserv[es] the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view' is load-bearing for the central contribution, yet the abstract supplies no description of the video generator architecture, identity conditioning (e.g., embeddings, memory banks, or reference frames), or mechanism by which trajectory controls alone prevent deviation during long occlusions.

    Authors: We agree that the abstract, owing to length constraints, does not describe the video generator architecture or conditioning details. These are presented in Sections 3.2 and 4 of the manuscript: the LLM produces explicit 3D trajectories that serve as continuous control signals to the generator; the generator is conditioned directly on these trajectories (via positional embeddings and temporal alignment) rather than on visual memory or reference frames. Because the trajectories encode object identities and dynamics independently of pixel observations, identity is preserved by construction during long occlusions. We will revise the abstract to include one sentence referencing this trajectory-based conditioning mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: framework description contains no equations, fitted parameters, or self-referential derivations

full rationale

The paper presents a high-level architectural framework that decouples LLM-based trajectory orchestration from a downstream video generator. No mathematical derivations, parameter-fitting procedures, or prediction steps are described in the abstract or reader-provided text. The central claim of 'ensuring' appearance stability is stated as a consequence of the decoupling design rather than derived from any self-referential equation or fitted input. No self-citations, uniqueness theorems, or ansatzes are invoked in the supplied material. The derivation chain is therefore self-contained at the descriptive level with no reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5696 in / 919 out tokens · 23019 ms · 2026-07-03T14:26:10.841186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 22 canonical work pages · 10 internal anchors

  1. [1]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  2. [2]

    Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

    Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Phyre: A new benchmark for physical reasoning

    Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new benchmark for physical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

  5. [5]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, 2024

  6. [6]

    Revisiting feature prediction for learning visual representations from video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. Transactions on Machine Learning Research (TMLR), 2024

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  8. [8]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

  9. [9]

    Genie: Generative interactive environments

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

  10. [10]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  11. [11]

    Gamegen-x: Interactive open-world game video generation

    Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations (ICLR), 2025

  12. [12]

    Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

    Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

  13. [13]

    Seine: Short-to-long video diffusion model for generative transition and prediction

    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InInternational Conference on Learning Representations (ICLR), 2024

  14. [14]

    Wan-Move: Motion-controllable video generation via latent trajectory guidance

    Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang. Wan-Move: Motion-controllable video generation via latent trajectory guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  15. [15]

    Oasis: A universe in a transformer

    Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. Project page, 2024. URLhttps://oasis-model.github.io/

  16. [16]

    Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

    Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

  17. [17]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024. 10

  18. [18]

    Motion prompting: Controlling video generation with motion trajectories

    Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez- Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, and Deqing Sun. Motion prompting: Controlling video generation with motion trajectories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern R...

  19. [19]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  20. [20]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

  21. [21]

    Streamingt2v: Consistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  22. [22]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

  23. [23]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  24. [24]

    Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

    Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

  25. [25]

    Vace: All-in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  26. [26]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision (ICCV), 2023

  27. [27]

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V . Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig ...

  28. [28]

    MagicMotion: Controllable video generation with dense-to-sparse trajectory guidance

    Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. MagicMotion: Controllable video generation with dense-to-sparse trajectory guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  29. [29]

    Vmem: Consistent interactive video scene generation with surfel-indexed view memory

    Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  30. [30]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  31. [31]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  32. [32]

    Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

    Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

  33. [33]

    Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744,

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025. 11

  34. [34]

    T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence (AAAI), 2024

  35. [35]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

  36. [36]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

  37. [37]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  38. [38]

    Advancing Open-source World Models

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

  39. [39]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  40. [40]

    Levitor: 3d trajectory oriented image-to-video synthesis

    Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang. Levitor: 3d trajectory oriented image-to-video synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

  41. [41]

    The world is your canvas: Painting promptable events with reference images, trajectories, and text.arXiv preprint arXiv:2512.16924, 2025

    Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, et al. The world is your canvas: Painting promptable events with reference images, trajectories, and text.arXiv preprint arXiv:2512.16924, 2025

  42. [42]

    arXiv preprint arXiv:2402.01566 (2024),https://arxiv.org/abs/2402.015664

    Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, and Hang Li. Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

  43. [43]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, 2024

  44. [44]

    Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

    Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

  45. [45]

    DragAnything: Motion control for anything using entity representation

    Wejia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. DragAnything: Motion control for anything using entity representation. In European Conference on Computer Vision (ECCV), 2024

  46. [46]

    Worldmem: Long-term consistent world simulation with memory

    Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  47. [47]

    Depth anything v2.Advances in Neural Information Processing Systems (NeurIPS), 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems (NeurIPS), 2024

  48. [48]

    Tenen- baum

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenen- baum. Clevrer: Collision events for video representation and reasoning. InInternational Conference on Learning Representations (ICLR), 2020

  49. [49]

    DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

    Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089, 2023

  50. [50]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

  51. [51]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 12

  52. [52]

    Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023

    Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023

  53. [53]

    coordinate_system

    Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 13 A Training and Compute Details. In this section, we provide a more comprehensive ...