TrajectoryMover: Generative Movement of Object Trajectories in Videos

Christopher E. Peters; Chun-Hao Paul Huang; Hyeonho Jeong; Kiran Chhatre; Paul Guerrero; Yulia Gryaditskaya

arxiv: 2603.29092 · v3 · pith:BHHOWUIAnew · submitted 2026-03-31 · 💻 cs.CV

TrajectoryMover: Generative Movement of Object Trajectories in Videos

Kiran Chhatre , Hyeonho Jeong , Yulia Gryaditskaya , Christopher E. Peters , Chun-Hao Paul Huang , Paul Guerrero This is my paper

Pith reviewed 2026-05-21 11:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords video editinggenerative videoobject trajectorysynthetic paired data3D motiontrajectory manipulationvideo generator

0 comments

The pith

Synthetic paired videos from a new pipeline let a fine-tuned model move an object's 3D trajectory in real videos while preserving relative motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the missing capability of moving an object's 3D motion trajectory inside an existing video without breaking plausibility or identity. It does this by creating TrajectoryAtlas, a pipeline that produces large-scale synthetic paired video data, then fine-tunes a generator called TrajectoryMover on that data. Previous attempts at paired data relied on constructing one video from the other in unpaired collections, which breaks down when the desired pair cannot be derived that way. A sympathetic reader would care because the method opens a new class of intuitive edits that change how objects travel through a scene in three dimensions.

Core claim

We introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories.

What carries the argument

TrajectoryAtlas pipeline that renders synthetic videos with controlled object trajectories to produce paired training examples, used to fine-tune TrajectoryMover for trajectory editing.

If this is right

Generative movement of an object's 3D motion trajectory becomes possible while keeping the video plausible and the object's identity intact.
Intuitive editing operations for short video clips are now available for changing object paths in 3D.
The approach bypasses the construction failure that occurs when one video in a pair cannot be derived from the other.
Large-scale paired data is supplied specifically for the trajectory-moving task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthetic-pair strategy could be tested on other video edits that require precise control over scene geometry.
If the model works on longer clips, it might support motion retargeting in film or VR post-production.
Combining trajectory editing with existing appearance-editing methods would create fuller video manipulation suites.

Load-bearing premise

The synthetic paired videos produced by TrajectoryAtlas are realistic and diverse enough for the fine-tuned TrajectoryMover to generalize to real-world video inputs.

What would settle it

If TrajectoryMover applied to held-out real videos generates trajectories that violate scene geometry or object identity, the claim that synthetic pairs suffice for generalization would be falsified.

Figures

Figures reproduced from arXiv: 2603.29092 by Christopher E. Peters, Chun-Hao Paul Huang, Hyeonho Jeong, Kiran Chhatre, Paul Guerrero, Yulia Gryaditskaya.

**Figure 1.** Figure 1: TrajectoryMover enables intuitive video editing by allowing users to translate an object’s 3D motion path to a new starting location using simple bounding box controls across diverse and complex scenarios, including drop, roll, and drag motions. Our model successfully aligns the generated trajectory with the target initial location. Furthermore, the model dynamically adapts the motion to the new path to en… view at source ↗

**Figure 2.** Figure 2: TrajectoryAtlas data generation pipeline. The pipeline has five stages, Asset Cache Preparation, Preflight Validation, Collision Aware Sampling and Scaling, Task Simulation, and Canonical Rendering with Runtime Metadata. Inputs including camera, 3D scene, lights and materials, and Objaverse or primitive assets are converted to reusable collision caches, then skip render preflight selects valid frames. Pair… view at source ↗

**Figure 3.** Figure 3: TrajectoryMover architecture. We concatenate three latent streams ztrj, zsrc, and zbb before denoising. In the control image, red marks the source box and green marks the target box. Data generation. TrajectoryAtlas uses Blender (Cycles) for rendering and PyBullet for physics. We use curated Evermotion [13] indoor scenes and a foreground object pool of 119 assets, with 98 Objaverse objects [11] and 21 p… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with baselines. We compare TrajectoryMover with SFM, ATI, DaS, VACE, and I2VEdit on four representative motion scenarios. Red boxes indicate the source object location in the input video, green boxes indicate the target location at frame 0, pink boxes highlight regions of failure, and cyan boxes highlight regions of success. TrajectoryMover follows the intended motion most consistent… view at source ↗

**Figure 5.** Figure 5: Qualitative ablation analysis. We compare the full model with ablations using only primitives, only scene modification, without scene modification, and droponly motion training. Red boxes indicate source object location, green boxes indicate target frame-0 location, and pink boxes mark representative regions of failure while cyan boxes highlight region of success results. The full model gives the best bal… view at source ↗

read the original abstract

Generative video editing has enabled several intuitive editing operations for short video clips that would previously have been difficult to achieve, especially for non-expert editors. Existing methods focus on prescribing an object's 3D or 2D motion trajectory in a video, or on altering the appearance of an object or a scene, while preserving both the video's plausibility and identity. Yet a method to move an object's 3D motion trajectory in a video, i.e., moving an object while preserving its relative 3D motion, is currently still missing. The main challenge lies in obtaining paired video data for this scenario. Previous methods typically rely on clever data generation approaches to construct plausible paired data from unpaired videos, but this approach fails if one of the videos in a pair can not easily be constructed from the other. Instead, we introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories. Project page: https://chhatrekiran.github.io/trajectorymover

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TrajectoryAtlas, a data generation pipeline for large-scale synthetic paired video data, and TrajectoryMover, a video generator fine-tuned on this data, to enable generative movement of an object's 3D motion trajectory in a video while preserving relative 3D motion, plausibility, and identity. The authors claim this paired-synthetic approach succeeds where prior unpaired-video construction methods fail.

Significance. If the central result holds, the work would address a clear gap in generative video editing by enabling trajectory manipulation that current methods cannot reliably perform. The synthetic paired-data route is a direct response to documented limitations of unpaired construction and could support downstream applications in intuitive video editing.

major comments (2)

[Abstract] Abstract: the claim that the method 'successfully enables generative movement of object trajectories' is stated without any quantitative results, ablation studies, or real-video transfer metrics. This leaves the generalization from TrajectoryAtlas synthetic pairs to natural video inputs without visible empirical support.
[Abstract / Method] The load-bearing assumption that synthetic paired videos are realistic and diverse enough for fine-tuned TrajectoryMover to overcome failure modes of unpaired methods (lighting, occlusion statistics, camera motion, object interactions) is not accompanied by concrete evidence or metrics in the provided text. Without such validation the headline result does not yet follow.

minor comments (1)

[Abstract] The abstract mentions a project page but does not indicate whether code, models, or the TrajectoryAtlas generation scripts will be released.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and clarify the empirical support present in the full manuscript while indicating revisions to improve visibility of key results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the method 'successfully enables generative movement of object trajectories' is stated without any quantitative results, ablation studies, or real-video transfer metrics. This leaves the generalization from TrajectoryAtlas synthetic pairs to natural video inputs without visible empirical support.

Authors: The abstract serves as a concise overview; the full manuscript reports quantitative results including trajectory accuracy metrics, perceptual quality scores, and direct comparisons against unpaired baselines on both synthetic and real videos. Ablation studies on data scale and diversity are also included, with explicit evaluation of generalization to natural inputs under varied conditions. We will revise the abstract to incorporate a brief reference to these supporting metrics. revision: yes
Referee: [Abstract / Method] The load-bearing assumption that synthetic paired videos are realistic and diverse enough for fine-tuned TrajectoryMover to overcome failure modes of unpaired methods (lighting, occlusion statistics, camera motion, object interactions) is not accompanied by concrete evidence or metrics in the provided text. Without such validation the headline result does not yet follow.

Authors: The manuscript presents targeted experiments on real videos that include diverse lighting, occlusion patterns, camera trajectories, and object interactions, with quantitative metrics demonstrating improved handling of these factors relative to unpaired approaches. We will add a concise summary of these validation results to the abstract and method section to make the supporting evidence more immediately visible. revision: yes

Circularity Check

0 steps flagged

No circularity: new synthetic data pipeline is self-contained

full rationale

The paper introduces TrajectoryAtlas as a fresh data-generation pipeline for synthetic paired videos and fine-tunes TrajectoryMover on that data. No equations, fitted parameters, or derivation steps appear that reduce by construction to prior results or self-citations. The central claim rests on the empirical success of the newly generated paired data rather than re-labeling or re-deriving existing quantities, so the derivation chain is independent and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The contribution centers on two newly introduced components whose realism and generalization properties are not independently verified in the abstract.

invented entities (2)

TrajectoryAtlas no independent evidence
purpose: Generate large-scale synthetic paired video data for trajectory editing
New pipeline created to solve paired-data scarcity that defeated earlier methods.
TrajectoryMover no independent evidence
purpose: Video generator fine-tuned to perform generative object-trajectory movement
The model trained on the synthetic pairs to realize the editing operation.

pith-pipeline@v0.9.0 · 5744 in / 1186 out tokens · 62704 ms · 2026-05-21T11:00:30.102654+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

[1]

AI, D.: Open-weight text-guided video editing (2025), https://platform.decart.ai/

work page 2025
[2]

ICCV (2025)

Bai, J., Xia, M., Fu, X., Wang, X., Mu, L., Cao, J., Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. ICCV (2025)

work page 2025
[3]

the method of paired comparisons

Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs i. the method of paired comparisons. Biometrika39(3/4), 324–345 (1952)

work page 1952
[4]

arXiv preprint arXiv:2511.20640 , year=

Burgert, R., Herrmann, C., Cole, F., Ryoo, M.S., Wadhwa, N., Voynov, A., Ruiz, N.: Motionv2v: Editing motion in a video. arXiv preprint arXiv:2511.20640 (2025)

work page arXiv 2025
[5]

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page 2026
[6]

CVPR (2025)

Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., Kang, B.: Video depth anything: Consistent depth estimation for super-long videos. CVPR (2025)

work page 2025
[7]

Control-a-video: Controllable text-to-video generation with diffusion models

Chen, W., Ji, Y., Wu, J., Wu, H., Xie, P., Li, J., Xia, X., Xiao, X., Lin, L.: Control- a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning. arXiv preprint arXiv:2305.13840 (2023)

work page arXiv 2023
[8]

In: CVPR

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR. pp. 24185–24198 (2024)

work page 2024
[9]

ICLR (2023)

Cong, Y., Xu, M., Simon, C., Chen, S., Ren, J., Xie, Y., Perez-Rua, J.M., Rosenhahn, B., Xiang, T., He, S.: Flatten: optical flow-guided attention for consistent text-to- video editing. ICLR (2023)

work page 2023
[10]

Coumans, E., Bai, Y.: Pybullet, a python module for physics simulation for games, robotics and machine learning.http://pybullet.org (2016–2019)

work page 2016
[11]

CVPR (2023)

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. CVPR (2023)

work page 2023
[12]

In: ECCV

Deng, Y., Wang, R., Zhang, Y., Tai, Y.W., Tang, C.K.: Dragvideo: Interactive drag-style video editing. In: ECCV. pp. 183–199. Springer (2024)

work page 2024
[13]

https://evermotion.org/, accessed: 2026-03-05 16 K

Evermotion: Evermotion. https://evermotion.org/, accessed: 2026-03-05 16 K. Chhatre et al

work page 2026
[14]

ACM SIGGRAPH 2025 Conference Papers (2025)

Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., Wang, W., Liu, Y.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. ACM SIGGRAPH 2025 Conference Papers (2025)

work page 2025
[15]

In: ACM SIGGRAPH Asia 2025 Conference Papers

Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., et al.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. In: ACM SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025)

work page 2025
[16]

arXiv preprint arXiv:2512.25075 (2025)

Huang, Z., Jeong, H., Chen, X., Gryaditskaya, Y., Wang, T.Y., Lasenby, J., Huang, C.H.: Spacetimepilot: Generative rendering of dynamic scenes across space and time. arXiv preprint arXiv:2512.25075 (2025)

work page arXiv 2025
[17]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jeong, H., Lee, S., Ye, J.C.: Reangle-a-video: 4d video generation as video-to- video translation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11164–11175 (2025)

work page 2025
[18]

ICLR (2023)

Jeong, H., Ye, J.C.: Ground-a-video: Zero-shot grounded video editing using text- to-image diffusion models. ICLR (2023)

work page 2023
[19]

ICCV (2025)

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. ICCV (2025)

work page 2025
[20]

Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., Siméoni, O., Vo, H.V., Labatut, P., Bojanowski, P.: Dinov2 meets text: A unified framework for image- and pixel-level vision-language alignment (2024)

work page 2024
[21]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

In: CVPR (2025)

Koo, J., Guerrero, P., Huang, C.H.P., Ceylan, D., Sung, M.: Videohandles: Editing 3d object compositions in videos using video generative priors. In: CVPR (2025)

work page 2025
[23]

In: CVPR

Koroglu, M., Caselles-Dupré, H., Jeanneret, G., Cord, M.: Onlyflow: Optical flow based motion conditioning for video diffusion models. In: CVPR. pp. 6226–6236 (2025)

work page 2025
[24]

arXiv preprint arXiv:2512.02015 , year=

Lee, Y.C., Zhang, Z., Huang, J., Wang, J.H., Lee, J.Y., Huang, J.B., Shechtman, E., Li, Z.: Generative video motion editing with 3d point tracks. arXiv preprint arXiv:2512.02015 (2025)

work page arXiv 2025
[25]

arXiv preprint arXiv:2601.02785 (2026)

Li, M., Chen, J., Zhao, S., Feng, W., Tu, P., He, Q.: Dreamstyle: A unified framework for video stylization. arXiv preprint arXiv:2601.02785 (2026)

work page arXiv 2026
[26]

CVPR (2025)

Liu, S., Wang, T., Wang, J.H., Liu, Q., Zhang, Z., Lee, J.Y., Li, Y., Yu, B., Lin, Z., Kim, S.Y., Jia, J.: Generative video propagation. CVPR (2025)

work page 2025
[27]

ACM SIGGRAPH Asia 2025 Conference Papers (2025)

Liu, Y., Wang, T., Liu, F., Wang, Z., Lau, R.W.: Shape-for-motion: Precise and consistent video editing with 3d proxy. ACM SIGGRAPH Asia 2025 Conference Papers (2025)

work page 2025
[28]

John Wiley & Sons, New York (1959)

Luce, R.D.: Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons, New York (1959)

work page 1959
[29]

In: Advances in Neural Information Processing Systems 28 (2015)

Maystre, L., Grossglauser, M.: Fast and accurate inference of plackett–luce models. In: Advances in Neural Information Processing Systems 28 (2015)

work page 2015
[30]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual feat...

work page 2023
[31]

ACM SIGGRAPH Asia 2024 Conference Papers (2024) TrajectoryMover: Generative Object Trajectory Movement in Videos 17

Ouyang, W., Dong, Y., Yang, L., Si, J., Pan, X.: I2vedit: First-frame-guided video editing via image-to-video diffusion models. ACM SIGGRAPH Asia 2024 Conference Papers (2024) TrajectoryMover: Generative Object Trajectory Movement in Videos 17

work page 2024
[32]

Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

Park, B., Kim, B.H., Chung, H., Ye, J.C.: Redirector: Creating any-length video retakes with rotary camera encoding. arXiv preprint arXiv:2511.19827 (2025)

work page arXiv 2025
[33]

In: ACM SIGGRAPH 2024 Conference Papers

Shi, X., Huang, Z., Wang, F.Y., Bian, W., Li, D., Zhang, Y., Zhang, M., Cheung, K.C., See, S., Qin, H., et al.: Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

work page 2024
[34]

Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

Shin, J., Li, Z., Zhang, R., Zhu, J.Y., Park, J., Shechtman, E., Huang, X.: Motion- stream: Real-time video generation with interactive motion controls. arXiv preprint arXiv:2511.01266 (2025)

work page arXiv 2025
[35]

Neurocomputing568, 127063 (2024)

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2024)

work page 2024
[36]

WACV (2021)

Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V.: Resolution-robust large mask inpainting with fourier convolutions. WACV (2021)

work page 2021
[37]

arXiv preprint arXiv:2312.02936 (2023)

Teng, Y., Xie, E., Wu, Y., Han, H., Li, Z., Liu, X.: Drag-a-video: Non-rigid video editing with point-based interaction. arXiv preprint arXiv:2312.02936 (2023)

work page arXiv 2023
[38]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

arXiv preprint arXiv:2505.22944 (2025)

Wang, A., Huang, H., Fang, J.Z., Yang, Y., Ma, C.: Ati: Any trajectory instruction for controllable video generation. arXiv preprint arXiv:2505.22944 (2025)

work page arXiv 2025
[40]

In: CVPR

Ye, Z., Huang, H., Wang, X., Wan, P., Zhang, D., Luo, W.: Stylemaster: Stylize your video with artistic generation and translation. In: CVPR. pp. 2630–2640 (2025)

work page 2025
[41]

In: Proceedings of the IEEE/CVF international conference on computer vision

Yu, M., Hu, W., Xing, J., Shan, Y.: Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 100–111 (2025)

work page 2025
[42]

In: ICCV

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV. pp. 3836–3847 (2023) 18 K. Chhatre et al. Supplementary Material A Overview This supplementary material includes two parts: (i) detailed baseline repurposing procedures (Sec. B), including the shared 3D trajectory extraction pipeline, method-specific ...

work page 2023

[1] [1]

AI, D.: Open-weight text-guided video editing (2025), https://platform.decart.ai/

work page 2025

[2] [2]

ICCV (2025)

Bai, J., Xia, M., Fu, X., Wang, X., Mu, L., Cao, J., Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. ICCV (2025)

work page 2025

[3] [3]

the method of paired comparisons

Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs i. the method of paired comparisons. Biometrika39(3/4), 324–345 (1952)

work page 1952

[4] [4]

arXiv preprint arXiv:2511.20640 , year=

Burgert, R., Herrmann, C., Cole, F., Ryoo, M.S., Wadhwa, N., Voynov, A., Ruiz, N.: Motionv2v: Editing motion in a video. arXiv preprint arXiv:2511.20640 (2025)

work page arXiv 2025

[5] [5]

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page 2026

[6] [6]

CVPR (2025)

Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., Kang, B.: Video depth anything: Consistent depth estimation for super-long videos. CVPR (2025)

work page 2025

[7] [7]

Control-a-video: Controllable text-to-video generation with diffusion models

Chen, W., Ji, Y., Wu, J., Wu, H., Xie, P., Li, J., Xia, X., Xiao, X., Lin, L.: Control- a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning. arXiv preprint arXiv:2305.13840 (2023)

work page arXiv 2023

[8] [8]

In: CVPR

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR. pp. 24185–24198 (2024)

work page 2024

[9] [9]

ICLR (2023)

Cong, Y., Xu, M., Simon, C., Chen, S., Ren, J., Xie, Y., Perez-Rua, J.M., Rosenhahn, B., Xiang, T., He, S.: Flatten: optical flow-guided attention for consistent text-to- video editing. ICLR (2023)

work page 2023

[10] [10]

Coumans, E., Bai, Y.: Pybullet, a python module for physics simulation for games, robotics and machine learning.http://pybullet.org (2016–2019)

work page 2016

[11] [11]

CVPR (2023)

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. CVPR (2023)

work page 2023

[12] [12]

In: ECCV

Deng, Y., Wang, R., Zhang, Y., Tai, Y.W., Tang, C.K.: Dragvideo: Interactive drag-style video editing. In: ECCV. pp. 183–199. Springer (2024)

work page 2024

[13] [13]

https://evermotion.org/, accessed: 2026-03-05 16 K

Evermotion: Evermotion. https://evermotion.org/, accessed: 2026-03-05 16 K. Chhatre et al

work page 2026

[14] [14]

ACM SIGGRAPH 2025 Conference Papers (2025)

Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., Wang, W., Liu, Y.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. ACM SIGGRAPH 2025 Conference Papers (2025)

work page 2025

[15] [15]

In: ACM SIGGRAPH Asia 2025 Conference Papers

Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., et al.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. In: ACM SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025)

work page 2025

[16] [16]

arXiv preprint arXiv:2512.25075 (2025)

Huang, Z., Jeong, H., Chen, X., Gryaditskaya, Y., Wang, T.Y., Lasenby, J., Huang, C.H.: Spacetimepilot: Generative rendering of dynamic scenes across space and time. arXiv preprint arXiv:2512.25075 (2025)

work page arXiv 2025

[17] [17]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jeong, H., Lee, S., Ye, J.C.: Reangle-a-video: 4d video generation as video-to- video translation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11164–11175 (2025)

work page 2025

[18] [18]

ICLR (2023)

Jeong, H., Ye, J.C.: Ground-a-video: Zero-shot grounded video editing using text- to-image diffusion models. ICLR (2023)

work page 2023

[19] [19]

ICCV (2025)

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. ICCV (2025)

work page 2025

[20] [20]

Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., Siméoni, O., Vo, H.V., Labatut, P., Bojanowski, P.: Dinov2 meets text: A unified framework for image- and pixel-level vision-language alignment (2024)

work page 2024

[21] [21]

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Ju, X., Wang, T., Zhou, Y., Zhang, H., Liu, Q., Zhao, N., Zhang, Z., Li, Y., Cai, Y., Liu, S., et al.: Editverse: Unifying image and video editing and generation with in-context learning. arXiv preprint arXiv:2509.20360 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

In: CVPR (2025)

Koo, J., Guerrero, P., Huang, C.H.P., Ceylan, D., Sung, M.: Videohandles: Editing 3d object compositions in videos using video generative priors. In: CVPR (2025)

work page 2025

[23] [23]

In: CVPR

Koroglu, M., Caselles-Dupré, H., Jeanneret, G., Cord, M.: Onlyflow: Optical flow based motion conditioning for video diffusion models. In: CVPR. pp. 6226–6236 (2025)

work page 2025

[24] [24]

arXiv preprint arXiv:2512.02015 , year=

Lee, Y.C., Zhang, Z., Huang, J., Wang, J.H., Lee, J.Y., Huang, J.B., Shechtman, E., Li, Z.: Generative video motion editing with 3d point tracks. arXiv preprint arXiv:2512.02015 (2025)

work page arXiv 2025

[25] [25]

arXiv preprint arXiv:2601.02785 (2026)

Li, M., Chen, J., Zhao, S., Feng, W., Tu, P., He, Q.: Dreamstyle: A unified framework for video stylization. arXiv preprint arXiv:2601.02785 (2026)

work page arXiv 2026

[26] [26]

CVPR (2025)

Liu, S., Wang, T., Wang, J.H., Liu, Q., Zhang, Z., Lee, J.Y., Li, Y., Yu, B., Lin, Z., Kim, S.Y., Jia, J.: Generative video propagation. CVPR (2025)

work page 2025

[27] [27]

ACM SIGGRAPH Asia 2025 Conference Papers (2025)

Liu, Y., Wang, T., Liu, F., Wang, Z., Lau, R.W.: Shape-for-motion: Precise and consistent video editing with 3d proxy. ACM SIGGRAPH Asia 2025 Conference Papers (2025)

work page 2025

[28] [28]

John Wiley & Sons, New York (1959)

Luce, R.D.: Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons, New York (1959)

work page 1959

[29] [29]

In: Advances in Neural Information Processing Systems 28 (2015)

Maystre, L., Grossglauser, M.: Fast and accurate inference of plackett–luce models. In: Advances in Neural Information Processing Systems 28 (2015)

work page 2015

[30] [30]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual feat...

work page 2023

[31] [31]

ACM SIGGRAPH Asia 2024 Conference Papers (2024) TrajectoryMover: Generative Object Trajectory Movement in Videos 17

Ouyang, W., Dong, Y., Yang, L., Si, J., Pan, X.: I2vedit: First-frame-guided video editing via image-to-video diffusion models. ACM SIGGRAPH Asia 2024 Conference Papers (2024) TrajectoryMover: Generative Object Trajectory Movement in Videos 17

work page 2024

[32] [32]

Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

Park, B., Kim, B.H., Chung, H., Ye, J.C.: Redirector: Creating any-length video retakes with rotary camera encoding. arXiv preprint arXiv:2511.19827 (2025)

work page arXiv 2025

[33] [33]

In: ACM SIGGRAPH 2024 Conference Papers

Shi, X., Huang, Z., Wang, F.Y., Bian, W., Li, D., Zhang, Y., Zhang, M., Cheung, K.C., See, S., Qin, H., et al.: Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

work page 2024

[34] [34]

Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

Shin, J., Li, Z., Zhang, R., Zhu, J.Y., Park, J., Shechtman, E., Huang, X.: Motion- stream: Real-time video generation with interactive motion controls. arXiv preprint arXiv:2511.01266 (2025)

work page arXiv 2025

[35] [35]

Neurocomputing568, 127063 (2024)

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2024)

work page 2024

[36] [36]

WACV (2021)

Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V.: Resolution-robust large mask inpainting with fourier convolutions. WACV (2021)

work page 2021

[37] [37]

arXiv preprint arXiv:2312.02936 (2023)

Teng, Y., Xie, E., Wu, Y., Han, H., Li, Z., Liu, X.: Drag-a-video: Non-rigid video editing with point-based interaction. arXiv preprint arXiv:2312.02936 (2023)

work page arXiv 2023

[38] [38]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

arXiv preprint arXiv:2505.22944 (2025)

Wang, A., Huang, H., Fang, J.Z., Yang, Y., Ma, C.: Ati: Any trajectory instruction for controllable video generation. arXiv preprint arXiv:2505.22944 (2025)

work page arXiv 2025

[40] [40]

In: CVPR

Ye, Z., Huang, H., Wang, X., Wan, P., Zhang, D., Luo, W.: Stylemaster: Stylize your video with artistic generation and translation. In: CVPR. pp. 2630–2640 (2025)

work page 2025

[41] [41]

In: Proceedings of the IEEE/CVF international conference on computer vision

Yu, M., Hu, W., Xing, J., Shan, Y.: Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 100–111 (2025)

work page 2025

[42] [42]

In: ICCV

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV. pp. 3836–3847 (2023) 18 K. Chhatre et al. Supplementary Material A Overview This supplementary material includes two parts: (i) detailed baseline repurposing procedures (Sec. B), including the shared 3D trajectory extraction pipeline, method-specific ...

work page 2023