WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

Alan Zhao; Bohai Gu; Dazhao Du; Jian Liu; Jie Zhang; Jinxiang Lai; Shuai Yang; Song Guo; Taiyi Wu; Xiaocheng Lu

arxiv: 2605.25077 · v1 · pith:QNA7BFDHnew · submitted 2026-05-24 · 💻 cs.CV

WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

Bohai Gu , Taiyi Wu , Yueyang Yuan , Jian Liu , Xiaocheng Lu , Dazhao Du , Jie Zhang , Jinxiang Lai

show 4 more authors

Shuai Yang Xiaotong Zhao Alan Zhao Song Guo

This is my paper

Pith reviewed 2026-06-30 12:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords video world modelsobject manipulationcamera navigationtrajectory controlLoRAstate persistenceautoregressive generationinteractive environments

0 comments

The pith

WorldCraft adds object manipulation to video world models by injecting camera-invariant trajectories through a spatial pathway while keeping camera navigation intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that video world models, currently limited to camera navigation, can be extended to let users select and move individual objects along sketched paths. It does this with three linked pieces: Normalized World Trajectory converts user input into a world-coordinate motion that is re-projected under the current camera view, Spatial-Pathway LoRA adds the object signal through an existing spatial route without touching the camera controller, and Trajectory-Anchored State Persistence keeps the moved object's location in memory so it reappears correctly after leaving the frame. A reader would care because real-world interaction is object-centric; without this, the models stay passive viewers rather than environments one can act inside. If the method works, the same pretrained models can support both viewpoint changes and object actions in one forward pass.

Core claim

WorldCraft demonstrates that a pretrained video world model can be given object-level trajectory control by representing user paths in a camera-invariant world frame via Normalized World Trajectory, routing that signal through Spatial-Pathway LoRA into the model's spatial pathway, and anchoring the resulting state with Trajectory-Anchored State Persistence so that moved objects remain consistent across autoregressive steps even when they leave the camera view.

What carries the argument

The trajectory-centric control pipeline: Normalized World Trajectory (NWT) for camera-invariant motion representation, Spatial-Pathway LoRA (SP-LoRA) for injecting the object signal, and Trajectory-Anchored State Persistence (TASP) for maintaining updated object positions in autoregressive memory.

If this is right

Users can draw an object path and receive video in which that object follows the path while the camera continues to move independently.
Camera navigation performance on the original tasks remains unchanged after the object-control addition.
Objects retain their updated world positions across long sequences that include periods when they are off-screen.
The same model handles both camera and object actions without separate fine-tuning branches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modular injection could be tried for other object actions such as rotation or scaling by extending the trajectory representation.
State persistence across camera excursions suggests these models could support multi-step planning tasks that require remembering object locations outside the current view.
Because the adaptation targets only the spatial pathway, similar LoRA-based extensions might add other control signals without retraining the entire model.
The separation of world-space motion from screen-space displacement could be tested in settings where camera motion is more erratic than the training distribution.

Load-bearing premise

The pretrained video world model already contains a usable spatial-control pathway that can accept the added world-trajectory signal via SP-LoRA without degrading its camera navigation performance.

What would settle it

A direct comparison showing that camera-only navigation quality drops after SP-LoRA training, or that generated frames fail to place the selected object at the positions dictated by the user-drawn world trajectory.

Figures

Figures reproduced from arXiv: 2605.25077 by Alan Zhao, Bohai Gu, Dazhao Du, Jian Liu, Jie Zhang, Jinxiang Lai, Shuai Yang, Song Guo, Taiyi Wu, Xiaocheng Lu, Xiaotong Zhao, Yueyang Yuan.

**Figure 1.** Figure 1: WorldCraft overview. (Top-left) WorldCraft lifts a user-specified 2D trajectory into a camera-decoupled normalized world space and re-projects it into per-frame trajectory conditions under the given camera actions. (Top-right) The trajectory and camera controls are injected through a lightweight pathway-selective LoRA on the spatial-control pathway, while the backbone attention and MLP layers remain frozen… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of trajectory control. WorldCraft achieves precise and composable controllability, jointly controlling camera motion and target-object trajectories. ← ← ← → → → Ours ← → ↑ ↓ W A S D WorldPlay ← ← ← → → → Matrix-game 2.0 ← ← ← → → → [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Long-horizon comparisons with off-camera motion. Given the same initial frame and camera actions, the goose moves right while the camera pans left and then returns. WorldCraft maintains scene consistency and, via TASP, recovers the goose at the correct off-camera-updated position when it re-enters view, whereas baselines either lose scene consistency or cannot track the off-camera object state. 4.3 Qualita… view at source ↗

**Figure 4.** Figure 4: Extended capabilities. Part: part-level control-the shield follows the trajectory while the body stays still. Multi: multi-object control-three objects steered simultaneously along independent trajectories. Long: 253-frame autoregressive rollout with long trajectory (∼10.5 s at 24 fps). WorldCraft also supports (i) Part-level control ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Curated training set statistics (N=27,027 clips after filtering). (a) Representative samples with the first frame, the SAM2 mask contour of the selected subject (blue), and the multi-point trajectory overlay (start in green, end in red, path in yellow). Subjects range from vehicles and pedestrians to pushed or carried objects under diverse weather and lighting. (b) Distribution of object displacement mag… view at source ↗

**Figure 6.** Figure 6: Automatic data curation pipeline. Given unlabeled video, we extract camera parameters, [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Activation-level analysis of camera – trajectory interaction. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Shared control subspace. 2D PCA projection of token-level camera-control updates u (blue) and trajectory-control updates v (red) at the peak block. The two distributions are aligned along the same principal directions rather than forming orthogonal subspaces, indicating that trajectory control is injected within the camera-compatible spatial-control subspace. Do camera and trajectory share a feature subsp… view at source ↗

read the original abstract

Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: users can move the camera, but cannot act on individual objects. Since real-world interaction is inherently object-centric, such models remain closer to passive scene observers than truly manipulable environments. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: First, Normalized World Trajectory (NWT) represents user-drawn motion in a camera-invariant world coordinate system and dynamically re-projects it under the current camera pose, separating object motion from camera-induced screen-space displacement; Spatial-Pathway LoRA (SP-LoRA) then injects this world-space signal through the model's spatial-control pathway, adding object manipulation capability while preserving the pretrained camera controller; finally, Trajectory-Anchored State Persistence (TASP) treats the world trajectory as a persistent spatial state and refreshes autoregressive memory after trajectory-conditioned generation, allowing moved objects to reappear at their updated positions after leaving the camera view. Experiments show that WorldCraft enables accurate object control, preserves the video-based world model's camera fidelity under camera-only evaluation, and maintains object state across long autoregressive rollouts with off-camera excursions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldCraft outlines a three-part pipeline to add object trajectories to camera-only video world models, but the abstract gives no evidence that SP-LoRA leaves camera performance untouched.

read the letter

The main takeaway is that this paper tries to move video world models from camera navigation to object-level actions by representing user paths in world coordinates, adapting via LoRA on the spatial pathway, and anchoring state for off-camera persistence.

The combination of NWT, SP-LoRA, and TASP looks like a fresh way to handle the separation of object motion from camera motion, and the framing of the gap is clear. Treating the trajectory as persistent spatial state is a practical step for autoregressive rollouts where objects leave the frame.

The work is straightforward in identifying that current models are still passive observers and in proposing a control signal that stays camera-invariant. That part is useful for anyone thinking about simulation or robotics applications.

The soft spot is exactly the one the stress test flags: the claim that camera fidelity is preserved rests on the unshown assumption that the LoRA update does not touch or shift the original camera controller. The abstract asserts preservation under camera-only evaluation but supplies no ablation, no before-and-after metrics, and no equations to check the separation. Without those, the central preservation result cannot be assessed.

This is aimed at groups already working on interactive video models who want to test object-centric extensions. A reader looking for concrete methods and results would get value once the full experiments are visible.

I would send it to peer review so the experiments, ablations, and any interference effects can be checked directly rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 1 minor

Summary. The paper introduces WorldCraft, a framework that extends pretrained video-based world models from camera navigation to object-level manipulation. It proposes three components: Normalized World Trajectory (NWT) to represent user-specified object paths in a camera-invariant world coordinate system that is dynamically re-projected; Spatial-Pathway LoRA (SP-LoRA) to inject the trajectory signal into the model's spatial-control pathway while leaving the camera controller intact; and Trajectory-Anchored State Persistence (TASP) to treat the world trajectory as persistent state and refresh autoregressive memory so that moved objects reappear correctly after off-camera excursions. Experiments are claimed to demonstrate accurate object control, preserved camera fidelity under camera-only evaluation, and stable object state over long rollouts.

Significance. If the three components deliver the stated outcomes without post-hoc tuning or hidden degradation of the base model, the work would meaningfully advance interactive video world models toward object-centric control, addressing a clear gap between current camera-only navigation and real-world manipulation needs. The explicit separation of object motion from camera motion via NWT and the state-persistence mechanism are conceptually clean contributions that could be adopted more broadly if the preservation property is rigorously shown.

major comments (2)

[Abstract] Abstract: the central claim that SP-LoRA 'preserves the pretrained camera controller' and that 'camera fidelity' is maintained under camera-only evaluation rests on the unverified assumption that the LoRA update applied to the spatial pathway does not alter weights or activations used by the camera-navigation pathway. No ablation isolating camera-only performance before versus after SP-LoRA training is described, leaving open the possibility that shared parameters or representation shifts degrade camera control even when object control succeeds.
[Abstract] Abstract (experiments paragraph): the three performance claims (accurate object control, preserved camera fidelity, maintained object state across long rollouts) are asserted without reference to quantitative metrics, baselines, or ablation tables that would allow verification that the outcomes are not the result of post-hoc tuning or selective evaluation. This makes it impossible to assess whether the method components actually deliver the stated results.

minor comments (1)

[Abstract] The abstract introduces several new acronyms (NWT, SP-LoRA, TASP) without a brief parenthetical expansion on first use, which reduces immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that SP-LoRA 'preserves the pretrained camera controller' and that 'camera fidelity' is maintained under camera-only evaluation rests on the unverified assumption that the LoRA update applied to the spatial pathway does not alter weights or activations used by the camera-navigation pathway. No ablation isolating camera-only performance before versus after SP-LoRA training is described, leaving open the possibility that shared parameters or representation shifts degrade camera control even when object control succeeds.

Authors: We acknowledge the concern. The manuscript reports camera-only evaluation results after SP-LoRA training to demonstrate preserved fidelity, but does not include an explicit before-versus-after ablation on the same metrics. We will add this ablation (comparing camera navigation performance pre- and post-training) to the Experiments section and update the abstract to reference it, ensuring the preservation claim is rigorously supported. revision: yes
Referee: [Abstract] Abstract (experiments paragraph): the three performance claims (accurate object control, preserved camera fidelity, maintained object state across long rollouts) are asserted without reference to quantitative metrics, baselines, or ablation tables that would allow verification that the outcomes are not the result of post-hoc tuning or selective evaluation. This makes it impossible to assess whether the method components actually deliver the stated results.

Authors: The full manuscript's Experiments section contains the supporting quantitative metrics, baselines, and ablation studies for the three claims. To address the abstract's lack of explicit references, we will revise the abstract to cite the specific tables and figures (e.g., object control accuracy in Table 2, camera fidelity in Figure 4, long-rollout state persistence in Table 3) so readers can directly locate the verification. revision: yes

Circularity Check

0 steps flagged

No circularity: method components are introduced by definition with no fitted predictions or self-citation reductions

full rationale

The paper defines three new components (Normalized World Trajectory, Spatial-Pathway LoRA, Trajectory-Anchored State Persistence) as a control pipeline to extend camera navigation to object manipulation. These are presented as engineering choices rather than derived predictions. No equations appear that reduce a claimed result to a fitted parameter or prior self-citation; the abstract and method description contain no quantitative fits, uniqueness theorems, or ansatzes justified by author overlap. The central claims rest on the explicit definitions of the pipeline plus experimental evaluation, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no free parameters, axioms, or invented entities can be extracted beyond the three named components, which are presented as engineering contributions rather than new physical postulates.

pith-pipeline@v0.9.1-grok · 5839 in / 990 out tokens · 29002 ms · 2026-06-30T12:00:11.127211+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 8 canonical work pages · 6 internal anchors

[1]

Gamegen-x: Interactive open-world game video generation

GameGen-X Authors. Gamegen-x: Interactive open-world game video generation. arXiv preprint arXiv:2411.00769, 2024

work page arXiv 2024
[2]

Genie 3: A large-scale foundation world model

Google DeepMind. Genie 3: A large-scale foundation world model. Technical report, DeepMind, 2024

2024
[3]

World Models

David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR) , 2020

2020
[5]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Y angyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Y eo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

2022
[8]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint, 2024

2024
[10]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea V edaldi, and Christian Rup- precht. Cotracker: It is better to track together. In Proceedings of the European Conference on Computer Vision (ECCV), 2024

2024
[11]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid his- tory condition

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Y uan Zhou, Shuai Shao, Tianbao Y u, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid his- tory condition. volume 2, page 6, 2025

2025
[12]

Dora: Weight-decomposed low-rank adaptation

Shih-Y ang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Y u-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In F orty-ﬁrst International Conference on Machine Learning, 2024

2024
[13]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Ming- min Chi, Y u Qiao, and Kaipeng Zhang. Y ume: An interactive world generation model. arXiv preprint arXiv:2507.17744, 2025

work page arXiv 2025
[14]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) , 2023

2023
[15]

Sam 2: Segment anything in images and videos

Nikhila Ravi, V alentin Gabeur, Y uan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan V asudev Al- wala, Nicolas Carion, Chao-Y uan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos. In International Confe...

2025
[16]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Tencent Hunyuan. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Least-squares estimation of transformation parameters between two point patterns

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence , 13(4):376–380, 1991

1991
[18]

Diffusion models are real-time game engines

Dani V alevski, Y aniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In International Conference on Learning Representations (ICLR) , 2025

2025
[19]

Wan-move: Wan move anything

Wan-Move Authors. Wan-move: Wan move anything. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[20]

Drivedreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. In European conference on computer vision (ECCV), pages 55–72. Springer, 2024

2024
[21]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing , 13(4):600–612, 2004

2004
[22]

Motionctrl: A uniﬁed and ﬂexible motion controller for video generation

Zhouxia Wang, Ziyang Y uan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A uniﬁed and ﬂexible motion controller for video generation. In ACM SIGGRAPH, 2024

2024
[23]

Worldplay: Interactive video generation with autoregressive world models

WorldPlay Team. Worldplay: Interactive video generation with autoregressive world models. 2025. Tencent Hunyuan

2025
[24]

Daydreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. In Conference on robot learning (CoRL) , pages 2226–2240. PMLR, 2023

2023
[25]

Draganything: Motion control for anything using entity representation

Weijia Wu, Zhuang Li, Y uchao Gu, Rui Zhao, Y efei He, David Junhao Zhang, Mike Zheng Shou, Y an Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. In Proceedings of the European Conference on Computer Vision (ECCV) , 2024

2024
[26]

Depth anything v2

Lihe Y ang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. In Advances in Neural Information Processing Systems (NeurIPS) , 2024

2024
[27]

Learning interactive real-world simulators

Sherry Y ang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In International Con- ference on Learning Representations (ICLR) , 2024. Outstanding Paper Award

2024
[28]

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 22963–22974, 2025

2025
[30]

Camera source

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 586–595, 2018. A Progressive training The shared spatial pathway identiﬁed in § 3.4 implies that trajectory training...

2018

[1] [1]

Gamegen-x: Interactive open-world game video generation

GameGen-X Authors. Gamegen-x: Interactive open-world game video generation. arXiv preprint arXiv:2411.00769, 2024

work page arXiv 2024

[2] [2]

Genie 3: A large-scale foundation world model

Google DeepMind. Genie 3: A large-scale foundation world model. Technical report, DeepMind, 2024

2024

[3] [3]

World Models

David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR) , 2020

2020

[5] [5]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Y angyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Y eo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

2022

[8] [8]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint, 2024

2024

[10] [10]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea V edaldi, and Christian Rup- precht. Cotracker: It is better to track together. In Proceedings of the European Conference on Computer Vision (ECCV), 2024

2024

[11] [11]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid his- tory condition

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Y uan Zhou, Shuai Shao, Tianbao Y u, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid his- tory condition. volume 2, page 6, 2025

2025

[12] [12]

Dora: Weight-decomposed low-rank adaptation

Shih-Y ang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Y u-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In F orty-ﬁrst International Conference on Machine Learning, 2024

2024

[13] [13]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Ming- min Chi, Y u Qiao, and Kaipeng Zhang. Y ume: An interactive world generation model. arXiv preprint arXiv:2507.17744, 2025

work page arXiv 2025

[14] [14]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) , 2023

2023

[15] [15]

Sam 2: Segment anything in images and videos

Nikhila Ravi, V alentin Gabeur, Y uan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan V asudev Al- wala, Nicolas Carion, Chao-Y uan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos. In International Confe...

2025

[16] [16]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Tencent Hunyuan. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Least-squares estimation of transformation parameters between two point patterns

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence , 13(4):376–380, 1991

1991

[18] [18]

Diffusion models are real-time game engines

Dani V alevski, Y aniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In International Conference on Learning Representations (ICLR) , 2025

2025

[19] [19]

Wan-move: Wan move anything

Wan-Move Authors. Wan-move: Wan move anything. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025

[20] [20]

Drivedreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. In European conference on computer vision (ECCV), pages 55–72. Springer, 2024

2024

[21] [21]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing , 13(4):600–612, 2004

2004

[22] [22]

Motionctrl: A uniﬁed and ﬂexible motion controller for video generation

Zhouxia Wang, Ziyang Y uan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A uniﬁed and ﬂexible motion controller for video generation. In ACM SIGGRAPH, 2024

2024

[23] [23]

Worldplay: Interactive video generation with autoregressive world models

WorldPlay Team. Worldplay: Interactive video generation with autoregressive world models. 2025. Tencent Hunyuan

2025

[24] [24]

Daydreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. In Conference on robot learning (CoRL) , pages 2226–2240. PMLR, 2023

2023

[25] [25]

Draganything: Motion control for anything using entity representation

Weijia Wu, Zhuang Li, Y uchao Gu, Rui Zhao, Y efei He, David Junhao Zhang, Mike Zheng Shou, Y an Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. In Proceedings of the European Conference on Computer Vision (ECCV) , 2024

2024

[26] [26]

Depth anything v2

Lihe Y ang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. In Advances in Neural Information Processing Systems (NeurIPS) , 2024

2024

[27] [27]

Learning interactive real-world simulators

Sherry Y ang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In International Con- ference on Learning Representations (ICLR) , 2024. Outstanding Paper Award

2024

[28] [28]

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 22963–22974, 2025

2025

[30] [30]

Camera source

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 586–595, 2018. A Progressive training The shared spatial pathway identiﬁed in § 3.4 implies that trajectory training...

2018