WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

Hanlin Wang; Hao Ouyang; Ka Leong Cheng; Qifeng Chen; Qingyan Bai; Qiuyu Wang; Wen Wang; Yanhong Zeng; Yihao Meng; Yixuan Li

arxiv: 2607.02517 · v1 · pith:3EMASAI6new · submitted 2026-07-02 · 💻 cs.CV

WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

Hanlin Wang , Hao Ouyang , Qiuyu Wang , Wen Wang , Qingyan Bai , Ka Leong Cheng , Yue Yu , Yixuan Li

show 5 more authors

Yihao Meng Zichen Liu Yanhong Zeng Yujun Shen Qifeng Chen

This is my paper

Pith reviewed 2026-07-03 14:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords world modelvideo generationpersistent memorycontrollable simulationLLM trajectory planningdynamic objectsappearance stability3D control signals

0 comments

The pith

WorldDirector decouples LLM-planned 3D trajectories from video rendering to preserve exact object identities after long absences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WorldDirector as a controllable video world model that separates semantic motion planning from pixel-level rendering. An LLM first generates 3D object trajectories and camera paths; these paths then act as explicit control signals fed to a video generator. The separation is intended to enforce physical consistency and keep the visual appearance of every dynamic object unchanged, even after the object leaves the frame for many frames and later returns. A reader would care because prior world models tie motion directly to visual observation and therefore lose object identity when continuous viewing is interrupted.

Core claim

WorldDirector explicitly decouples semantic motion orchestration from visual generation by leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, thereby ensuring strict physical logic and appearance stability while preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view.

What carries the argument

Decoupling of semantic motion orchestration from visual generation, with LLM-generated 3D trajectories serving as control signals to a downstream video generator.

If this is right

The method supports synthesis of complex and extended events.
It delivers controllability over both object motion and camera viewpoint.
It maintains persistent dynamic object memory across long absences from view.
It enables unrestricted viewpoint exploration in the generated world.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of planning from rendering could let users swap in different video generators while keeping the same trajectory planner.
Benchmarks that insert long temporal gaps between an object's appearances would directly test the claimed identity preservation.
The same trajectory-control pattern might be applied to non-video simulators such as physics engines if the control interface can be standardized.

Load-bearing premise

The LLM-generated 3D trajectories and camera paths can be converted into control signals that a downstream video generator follows without violating appearance consistency or physical plausibility.

What would settle it

Create a test sequence in which a moving object exits the camera view for many frames then re-enters; measure whether its rendered appearance matches its earlier appearance to within the same pixel-level identity metric used by the authors.

Figures

Figures reproduced from arXiv: 2607.02517 by Hanlin Wang, Hao Ouyang, Ka Leong Cheng, Qifeng Chen, Qingyan Bai, Qiuyu Wang, Wen Wang, Yanhong Zeng, Yihao Meng, Yixuan Li, Yue Yu, Yujun Shen, Zichen Liu.

**Figure 1.** Figure 1: Controllable world simulation with persistent dynamic memory via WorldDirector. By decoupling 3D semantic orchestration from latent video synthesis, our framework autoregressively generates long-horizon videos via causal chunks, ensuring rigorous dynamic memory and object permanence. Please refer to the video results on our project page for intuitive demonstrations. Abstract We present WorldDirector, a hig… view at source ↗

**Figure 2.** Figure 2: Overview of WorldDirector. An LLM orchestrates 3D trajectories that are projected into 2D Location Conditions for causal chunk generation. Location (B) and Appearance (A) conditions are channel-concatenated with the noisy latent, while historical Context (M) is sequence-concatenated. During generation, temporal drop is applied and an asymmetric attention routing prevents noise from polluting the context me… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with baselines. Note that HyDRA uses the initial 10s of our results as a reference video for its generation. Please refer to the video results on our project page for intuitive demonstrations [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation on Appearance Condition. We conduct experiments on a case involving complex [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: A generation example of Promptable World Events. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results demonstrate that our method supports the synthesis of complex and extended events with unprecedented controllability and persistent dynamic object memory. Project Page: https://worlddirector.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decoupling of LLM trajectory planning from video generation for persistent object memory is a clean split worth testing, but the abstract's guarantee of exact identity preservation after long absences lacks any supporting architecture or evidence details.

read the letter

WorldDirector splits motion planning to an LLM that outputs 3D trajectories and camera paths, then uses those as controls for a video generator. This is positioned as a way to get persistent memory for objects that come and go from view, unlike models that keep everything entangled in pixel space.

The separation itself is the main novelty here. It targets the practical issue of maintaining object identity over long sequences without constant visual input, which matters for building simulators that can handle extended events.

That said, the abstract's assertion that the method 'ensures strict physical logic and appearance stability' and preserves exact identities after long absences stands out as the part that needs checking. There's no description of how the video generator actually remembers or conditions on object appearance, and no mention of any ablations or metrics. The stress-test concern about trajectory controls not being sufficient for pixel-level consistency in standard generators seems to apply directly from the abstract alone. Without seeing the implementation details or results in the full paper, it's difficult to know if they have a mechanism that actually delivers on the guarantee.

This is for people building controllable video simulators, especially those working on robotics or data generation pipelines. A reader looking for a new framework to experiment with could get something out of it.

I would send this to peer review. The core idea is worth a closer look even if the persistence claims require more support.

Referee Report

1 major / 1 minor

Summary. The paper presents WorldDirector, a controllable video world model that decouples semantic motion orchestration (via LLM coordination of 3D trajectories and camera movements) from visual generation. These trajectories serve as control signals to a downstream video generator, with the explicit goal of enforcing strict physical logic, appearance stability, and persistent dynamic object memory so that exact visual identities of entities are preserved even after prolonged absences from the scene. The abstract reports that this enables synthesis of complex, extended events with high controllability.

Significance. If the claimed decoupling and control mechanism succeed in delivering pixel-level identity persistence and physical consistency without continuous observation, the work would address a core limitation in current video world models and generative simulators, with potential impact on applications requiring long-horizon scene consistency such as robotics planning and interactive media.

major comments (1)

[Abstract] Abstract: the claim that the method 'ensures strict physical logic and appearance stability' and 'successfully preserv[es] the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view' is load-bearing for the central contribution, yet the abstract supplies no description of the video generator architecture, identity conditioning (e.g., embeddings, memory banks, or reference frames), or mechanism by which trajectory controls alone prevent deviation during long occlusions.

minor comments (1)

No quantitative metrics, ablation studies, or implementation details are referenced in the provided abstract to allow assessment of the claimed improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address the concern point-by-point below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the method 'ensures strict physical logic and appearance stability' and 'successfully preserv[es] the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view' is load-bearing for the central contribution, yet the abstract supplies no description of the video generator architecture, identity conditioning (e.g., embeddings, memory banks, or reference frames), or mechanism by which trajectory controls alone prevent deviation during long occlusions.

Authors: We agree that the abstract, owing to length constraints, does not describe the video generator architecture or conditioning details. These are presented in Sections 3.2 and 4 of the manuscript: the LLM produces explicit 3D trajectories that serve as continuous control signals to the generator; the generator is conditioned directly on these trajectories (via positional embeddings and temporal alignment) rather than on visual memory or reference frames. Because the trajectories encode object identities and dynamics independently of pixel observations, identity is preserved by construction during long occlusions. We will revise the abstract to include one sentence referencing this trajectory-based conditioning mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: framework description contains no equations, fitted parameters, or self-referential derivations

full rationale

The paper presents a high-level architectural framework that decouples LLM-based trajectory orchestration from a downstream video generator. No mathematical derivations, parameter-fitting procedures, or prediction steps are described in the abstract or reader-provided text. The central claim of 'ensuring' appearance stability is stated as a consequence of the decoupling design rather than derived from any self-referential equation or fitted input. No self-citations, uniqueness theorems, or ansatzes are invoked in the supplied material. The derivation chain is therefore self-contained at the descriptive level with no reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5696 in / 919 out tokens · 23019 ms · 2026-07-03T14:26:10.841186+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 22 canonical work pages · 10 internal anchors

[1]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[2]

Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

work page arXiv 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Phyre: A new benchmark for physical reasoning

Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new benchmark for physical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

2019
[5]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, 2024

2024
[6]

Revisiting feature prediction for learning visual representations from video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. Transactions on Machine Learning Research (TMLR), 2024

2024
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

2024
[9]

Genie: Generative interactive environments

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

2024
[10]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gamegen-x: Interactive open-world game video generation

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations (ICLR), 2025

2025
[12]

Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

work page arXiv 2026
[13]

Seine: Short-to-long video diffusion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InInternational Conference on Learning Representations (ICLR), 2024

2024
[14]

Wan-Move: Motion-controllable video generation via latent trajectory guidance

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang. Wan-Move: Motion-controllable video generation via latent trajectory guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[15]

Oasis: A universe in a transformer

Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. Project page, 2024. URLhttps://oasis-model.github.io/

2024
[16]

Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

work page arXiv 2026
[17]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024. 10

2024
[18]

Motion prompting: Controlling video generation with motion trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez- Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, and Deqing Sun. Motion prompting: Controlling video generation with motion trajectories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern R...

2025
[19]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[22]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

work page arXiv 2025
[23]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[24]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

2025
[25]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[26]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision (ICCV), 2023

2023
[27]

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V . Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig ...

2024
[28]

MagicMotion: Controllable video generation with dense-to-sparse trajectory guidance

Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. MagicMotion: Controllable video generation with dense-to-sparse trajectory guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[29]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[30]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[31]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023
[32]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

work page arXiv 2025
[33]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744,

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025. 11

work page arXiv 2025
[34]

T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence (AAAI), 2024

2024
[35]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

work page arXiv 2023
[36]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Advancing Open-source World Models

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Levitor: 3d trajectory oriented image-to-video synthesis

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang. Levitor: 3d trajectory oriented image-to-video synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

2025
[41]

The world is your canvas: Painting promptable events with reference images, trajectories, and text.arXiv preprint arXiv:2512.16924, 2025

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, et al. The world is your canvas: Painting promptable events with reference images, trajectories, and text.arXiv preprint arXiv:2512.16924, 2025

work page arXiv 2025
[42]

arXiv preprint arXiv:2402.01566 (2024),https://arxiv.org/abs/2402.015664

Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, and Hang Li. Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

work page arXiv 2024
[43]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, 2024

2024
[44]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

work page arXiv 2026
[45]

DragAnything: Motion control for anything using entity representation

Wejia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. DragAnything: Motion control for anything using entity representation. In European Conference on Computer Vision (ECCV), 2024

2024
[46]

Worldmem: Long-term consistent world simulation with memory

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[47]

Depth anything v2.Advances in Neural Information Processing Systems (NeurIPS), 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[48]

Tenen- baum

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenen- baum. Clevrer: Collision events for video representation and reasoning. InInternational Conference on Learning Representations (ICLR), 2020

2020
[49]

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

work page arXiv 2025
[51]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 12

2023
[52]

Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023

work page arXiv 2023
[53]

coordinate_system

Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 13 A Training and Compute Details. In this section, we provide a more comprehensive ...

2025

[1] [1]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[2] [2]

Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

work page arXiv 2025

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Phyre: A new benchmark for physical reasoning

Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new benchmark for physical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

2019

[5] [5]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, 2024

2024

[6] [6]

Revisiting feature prediction for learning visual representations from video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. Transactions on Machine Learning Research (TMLR), 2024

2024

[7] [7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

2024

[9] [9]

Genie: Generative interactive environments

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

2024

[10] [10]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Gamegen-x: Interactive open-world game video generation

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations (ICLR), 2025

2025

[12] [12]

Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

work page arXiv 2026

[13] [13]

Seine: Short-to-long video diffusion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InInternational Conference on Learning Representations (ICLR), 2024

2024

[14] [14]

Wan-Move: Motion-controllable video generation via latent trajectory guidance

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang. Wan-Move: Motion-controllable video generation via latent trajectory guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[15] [15]

Oasis: A universe in a transformer

Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. Project page, 2024. URLhttps://oasis-model.github.io/

2024

[16] [16]

Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

work page arXiv 2026

[17] [17]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024. 10

2024

[18] [18]

Motion prompting: Controlling video generation with motion trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez- Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, and Deqing Sun. Motion prompting: Controlling video generation with motion trajectories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern R...

2025

[19] [19]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[22] [22]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

work page arXiv 2025

[23] [23]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[24] [24]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

2025

[25] [25]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[26] [26]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision (ICCV), 2023

2023

[27] [27]

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V . Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig ...

2024

[28] [28]

MagicMotion: Controllable video generation with dense-to-sparse trajectory guidance

Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. MagicMotion: Controllable video generation with dense-to-sparse trajectory guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[29] [29]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[30] [30]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[31] [31]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023

[32] [32]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

work page arXiv 2025

[33] [33]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744,

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025. 11

work page arXiv 2025

[34] [34]

T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence (AAAI), 2024

2024

[35] [35]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

work page arXiv 2023

[36] [36]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Advancing Open-source World Models

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Levitor: 3d trajectory oriented image-to-video synthesis

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang. Levitor: 3d trajectory oriented image-to-video synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

2025

[41] [41]

The world is your canvas: Painting promptable events with reference images, trajectories, and text.arXiv preprint arXiv:2512.16924, 2025

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, et al. The world is your canvas: Painting promptable events with reference images, trajectories, and text.arXiv preprint arXiv:2512.16924, 2025

work page arXiv 2025

[42] [42]

arXiv preprint arXiv:2402.01566 (2024),https://arxiv.org/abs/2402.015664

Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, and Hang Li. Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

work page arXiv 2024

[43] [43]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, 2024

2024

[44] [44]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

work page arXiv 2026

[45] [45]

DragAnything: Motion control for anything using entity representation

Wejia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. DragAnything: Motion control for anything using entity representation. In European Conference on Computer Vision (ECCV), 2024

2024

[46] [46]

Worldmem: Long-term consistent world simulation with memory

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[47] [47]

Depth anything v2.Advances in Neural Information Processing Systems (NeurIPS), 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[48] [48]

Tenen- baum

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenen- baum. Clevrer: Collision events for video representation and reasoning. InInternational Conference on Learning Representations (ICLR), 2020

2020

[49] [49]

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

work page arXiv 2025

[51] [51]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 12

2023

[52] [52]

Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023

work page arXiv 2023

[53] [53]

coordinate_system

Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 13 A Training and Compute Details. In this section, we provide a more comprehensive ...

2025