pith. sign in

arxiv: 2607.00861 · v1 · pith:HJL6XPCInew · submitted 2026-07-01 · 💻 cs.CV · cs.GR

TrajLoc: Trajectory-Attention Localization for Multi-Object Motion Control

Pith reviewed 2026-07-02 14:19 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords multi-object trajectory controlimage-to-video generationattention localizationGaussian heatmapsobject identity preservationmotion controlvideo synthesis
0
0 comments X

The pith

Substituting cross-attention weights with per-object Gaussian heatmaps isolates trajectories for multi-object video control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that multi-object motion control in image-to-video generation works better when each object's trajectory is enforced independently inside the attention layers rather than through entangled shared signals. A sympathetic reader would care because existing methods lose object identities and fail to follow distinct paths accurately once scenes become crowded or paths cross. The approach achieves isolation by replacing the cross-attention weights of each object token with a Gaussian heatmap centered on its target location at every frame. The same token interface also carries trajectory and depth information while first-frame appearance encodes identity. Tests on six datasets with up to twenty objects and two different backbones show consistent gains in visual quality and path accuracy.

Core claim

TrajLoc enforces strict per-object spatial constraints directly within the attention layers by substituting the cross-attention weights of each object token with a Gaussian heatmap centered on its target location at every frame. The same per-object token interface carries trajectory and depth through a learned embedding and preserves identity by encoding first frame appearance in place of an object token.

What carries the argument

Substitution of cross-attention weights with per-object Gaussian heatmaps centered on target locations at every frame.

If this is right

  • Achieves average gains of +4.3 dB PSNR in visual fidelity across datasets.
  • Reduces trajectory end point error by 51 percent relative to strongest baselines.
  • Scales to scenes containing up to 20 simultaneously controlled objects.
  • Applies to two architecturally distinct video generation backbones.
  • Maintains improvements on out-of-distribution real-world scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The per-object token interface could support additional conditioning signals such as velocity or interaction rules without redesigning the attention structure.
  • The method may extend to tasks that require spatial localization in other generative domains such as image editing or 3D asset animation.
  • Overlapping heatmaps at intersection points could be monitored to detect and resolve potential identity swaps automatically.

Load-bearing premise

The per-object Gaussian heatmaps isolate instances and enforce spatial constraints independently without introducing artifacts or breaking coherent video synthesis when paths intersect or occlude.

What would settle it

Apply the method to a video scene where two object trajectories cross or one occludes the other and check whether object identities merge or video coherence visibly breaks.

Figures

Figures reproduced from arXiv: 2607.00861 by Avi Ben-Cohen, Inbar Huberman-Spiegelglas, Michael Rotman, Omer Sela, Sagie Benaim.

Figure 1
Figure 1. Figure 1: TrajLoc. Given a first frame and a set of target trajectories (left column, with colored polylines), the goal is to generate a video that moves each object along its prescribed path while preserving its visual identity. Top: multiple pedestrians on a synthetic urban scene. Bottom: sheep in a natural outdoor scene. The remaining columns show three uniformly spaced generated frames with the ground-truth posi… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of TrajLoc. A structured text prompt “Scene where o0 moves [traj0 ], o1 moves [traj1 ], . . . ” is constructed, where each oi is the object’s given category name. The trajectory tokens [traji ] are replaced with learned embeddings from the pretrained (frozen) Enctraj, which independently encodes each target trajectory (xi(t), yi(t), di(t)). The learned appearance encoder Encapp encodes each obj… view at source ↗
Figure 3
Figure 3. Figure 3: Trajectory autoencoder pretraining. The trajectory encoder maps each object trajectory τi(t) = (xi(t), yi(t), di(t)) and a temporal position channel to a token embedding ⟨traji ⟩ in the text encoder space. The embedding passes through the frozen text encoder before a decoder reconstructs the original trajectory, ensuring the representation remains informative after text encoder processing. where m ∈ {1, . … view at source ↗
Figure 4
Figure 4. Figure 4: Each row shows three generated frames spanning the video (frames 9, 29, 49) where colored [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison. Top-left: CogVideoX-5B results on DAVIS fish (5 objects, real-world). Top-right: WaN 2.1-14B results on MOT17 (4 pedestrians, real-world). Bottom-left: CogVideoX-5B results on MoVi-Extended (6 objects, synthetic). Bottom-right: WaN 2.1-14B results on MOTSynth (10 pedestrians, synthetic). Each row shows three generated frames from a different method, with ground-truth object position… view at source ↗
read the original abstract

Controlling the motion of multiple objects in image-to-video (I2V) generation requires preserving object identities while enforcing adherence to distinct target trajectories. This becomes particularly challenging as the number of objects increases and their paths intersect or occlude one another. Existing approaches entangle multiple trajectories within a shared, dense conditioning signal, making object-level correspondence difficult to preserve in crowded scenes. We depart from this paradigm and enforce a strict, per object spatial constraint that isolates instances independently. Our method, TrajLoc, achieves this directly within the attention layers by substituting the cross-attention weights of each object token with a Gaussian heatmap centered on its target location at every frame. The same per object token interface carries trajectory and depth through a learned embedding and preserves identity by encoding first frame appearance in place of an object token. Evaluations across six datasets, featuring up to 20 simultaneously controlled objects and out of distribution real world scenes, demonstrate that our method consistently improves both visual fidelity and trajectory adherence. Applied to two architecturally distinct backbones (CogVideoX 5B and WaN 2.1 14B), our approach achieves average gains of +4.3 dB PSNR and a 51% reduction in trajectory end point error compared to the strongest baselines. Project page: https://sela-omer.github.io/traj-loc/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TrajLoc for multi-object motion control in image-to-video generation. It enforces per-object spatial constraints by substituting cross-attention weights of each object token with an independent Gaussian heatmap centered on the target location at every frame, while carrying trajectory and depth via a learned embedding and preserving identity through first-frame appearance encoding. The authors claim that this yields consistent gains in visual fidelity and trajectory adherence, with average improvements of +4.3 dB PSNR and 51% reduction in endpoint error versus strongest baselines when applied to CogVideoX 5B and WaN 2.1 14B across six datasets featuring up to 20 objects and out-of-distribution real-world scenes.

Significance. If the reported gains prove robust, the approach would provide a lightweight, backbone-agnostic mechanism for object isolation inside existing attention layers of large video diffusion models. The evaluation across two architecturally distinct backbones and on out-of-distribution scenes is a strength that supports broader applicability.

major comments (2)
  1. [Abstract] Abstract: The central claim concerns performance in crowded scenes where paths intersect or occlude, yet the reported aggregate metrics (+4.3 dB PSNR, 51% EPE reduction) provide no breakdown or separate results on the intersecting/occluding subset. This leaves the load-bearing assumption that independent Gaussian substitutions preserve coherence without identity leakage or artifacts untested by the presented evidence.
  2. [Method] Method description: The substitution of cross-attention weights with per-object Gaussians is presented as operating directly inside the model's attention layers, but no equations, pseudocode, or implementation details specify whether the replacement occurs before or after softmax, per attention head, or with cross-object normalization. This underspecification directly affects whether the joint attention computation can still model interactions when trajectories cross.
minor comments (2)
  1. The quantitative claims would be strengthened by reporting error bars, standard deviations, or per-dataset breakdowns rather than averages alone.
  2. Dataset statistics (number of sequences, resolution, trajectory generation procedure, and annotation protocol) are not described, which hinders assessment of the evaluation scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation breakdown and method specification. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim concerns performance in crowded scenes where paths intersect or occlude, yet the reported aggregate metrics (+4.3 dB PSNR, 51% EPE reduction) provide no breakdown or separate results on the intersecting/occluding subset. This leaves the load-bearing assumption that independent Gaussian substitutions preserve coherence without identity leakage or artifacts untested by the presented evidence.

    Authors: We agree that the central claim focuses on crowded scenes with intersections and occlusions, and that aggregate metrics alone leave this aspect under-tested. The six datasets do contain such cases (up to 20 objects), but no subset analysis is currently reported. In the revised manuscript we will add a dedicated breakdown of PSNR and endpoint error on the intersecting/occluding subset to directly evaluate coherence preservation. revision: yes

  2. Referee: [Method] Method description: The substitution of cross-attention weights with per-object Gaussians is presented as operating directly inside the model's attention layers, but no equations, pseudocode, or implementation details specify whether the replacement occurs before or after softmax, per attention head, or with cross-object normalization. This underspecification directly affects whether the joint attention computation can still model interactions when trajectories cross.

    Authors: The current manuscript does not provide these implementation specifics, which is an oversight in the method description. We will revise the method section to include explicit equations and pseudocode clarifying the substitution timing (relative to softmax), per-head application, and normalization procedure. This will make the interaction modeling behavior reproducible and address the concern about cross-trajectory coherence. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural substitution applied to external backbones

full rationale

The paper presents TrajLoc as a direct per-object substitution of cross-attention weights by Gaussian heatmaps inside unmodified diffusion backbones (CogVideoX 5B, WaN 2.1 14B). No equations, fitted parameters, or predictions are shown that reduce the reported PSNR/EPE gains to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central mechanism and empirical results on six datasets stand as an independent architectural change evaluated against external baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; ledger entries are therefore minimal and provisional.

free parameters (1)
  • learned embedding for trajectory and depth
    Mentioned as the carrier for trajectory and depth information through the per-object token interface.
axioms (1)
  • domain assumption Gaussian heatmap substitution isolates object instances and enforces spatial constraints without side effects on coherence
    Central to the method description in the abstract.

pith-pipeline@v0.9.1-grok · 5787 in / 1192 out tokens · 23425 ms · 2026-07-02T14:19:38.351814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  2. [2]

    Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. InSIGGRAPH, 2023

  3. [3]

    Motion-zero: Zero-shot moving object control framework for diffusion-based video generation

    Changgu Chen, Junwei Shu, Gaoqi He, Changbo Wang, and Yang Li. Motion-zero: Zero-shot moving object control framework for diffusion-based video generation. InAAAI, 2025

  4. [4]

    Wan-move: Motion-controllable video generation via latent trajectory guidance

    Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang. Wan-move: Motion-controllable video generation via latent trajectory guidance. InNeurIPS, 2025

  5. [5]

    Motchallenge: A benchmark for single-camera multiple target tracking.IJCV, 2021

    Patrick Dendorfer, Aljoša Ošep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura Leal-Taixé. Motchallenge: A benchmark for single-camera multiple target tracking.IJCV, 2021

  6. [6]

    Tenenbaum, Dale Schuurmans, and Pieter Abbeel

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2023

  7. [7]

    Evangelidis and Emmanouil Z

    Georgios D. Evangelidis and Emmanouil Z. Psarakis. Parametric image alignment using enhanced correlation coefficient maximization.IEEE Trans. Pattern Anal. Mach. Intell., 30(10): 1858–1865, 2008. doi: 10.1109/TPAMI.2008.113

  8. [8]

    Motsynth: How can synthetic data help pedestrian detection and tracking? InICCV, 2021

    Matteo Fabbri, Guillem Brasó, Gianluca Maugeri, Orcun Cetintas, Riccardo Gasparini, Aljoša Ošep, Simone Calderara, Laura Leal-Taixé, and Rita Cucchiara. Motsynth: How can synthetic data help pedestrian detection and tracking? InICCV, 2021

  9. [9]

    Two-frame motion estimation based on polynomial expansion

    Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. InImage Analysis (SCIA 2003), volume 2749 ofLecture Notes in Computer Science, pages 363–370. Springer, 2003. doi: 10.1007/3-540-45103-X_50

  10. [10]

    Motion prompting: Controlling video generation with motion trajectories

    Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Ta- tiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, et al. Motion prompting: Controlling video generation with motion trajectories. InCVPR, 2025

  11. [11]

    Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control

    Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InCVPR, pages 22404–22415, 2025

  12. [12]

    Prompt-to-prompt image editing with cross attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. InICLR, 2023

  13. [13]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

  14. [14]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

  15. [15]

    Peekaboo: Interactive video generation via masked-diffusion

    Yash Jain, Anshul Nasery, Vibhav Vineet, and Harkirat Behl. Peekaboo: Interactive video generation via masked-diffusion. InCVPR, 2024

  16. [16]

    Posetraj: Pose-aware trajectory control in video diffusion

    Longbin Ji, Lei Zhong, Pengfei Wei, and Changjian Li. Posetraj: Pose-aware trajectory control in video diffusion. InCVPR, 2025. 10

  17. [17]

    Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance

    Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance. InICCV, 2025

  18. [18]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InCVPR, 2023

  19. [19]

    Dreamitate: Real-world visuomotor policy learning via video generation

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InConference on Robot Learning (CoRL), 2024

  20. [20]

    Intragen: Trajectory- controlled video generation for object interactions.arXiv preprint arXiv:2411.16804, 2024

    Zuhao Liu, Aleksandar Yanev, Ahmad Mahmood, Ivan Nikolov, Saman Motamed, Wei-Shi Zheng, Xi Wang, Lei Sun, Luc Van Gool, and Danda Pani Paudel. Intragen: Trajectory- controlled video generation for object interactions.arXiv preprint arXiv:2411.16804, 2024

  21. [21]

    Trailblazer: Trajectory control for diffusion-based video generation

    Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video generation. InSIGGRAPH Asia, 2024

  22. [22]

    Sg-i2v: Self-guided trajectory control in image-to-video generation

    Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski, and David B Lindell. Sg-i2v: Self-guided trajectory control in image-to-video generation. InICLR, 2025

  23. [23]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv:1704.00675, 2017

  24. [24]

    Freetraj: Tuning-free trajectory control in video diffusion models, 2024

    Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. Freetraj: Tuning-free trajectory control in video diffusion models, 2024. URL https://arxiv.org/ abs/2406.16863

  25. [25]

    Towards accurate generative models of video: A new metric & challenges,

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges,

  26. [26]

    URLhttps://arxiv.org/abs/1812.01717

  27. [27]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  28. [28]

    Dragentity: Trajectory guided video generation using entity and positional relationships

    Zhang Wan, Sheng Tang, Jiawei Wei, Ruize Zhang, and Juan Cao. Dragentity: Trajectory guided video generation using entity and positional relationships. InACM MM, 2024

  29. [29]

    Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

    Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

  30. [30]

    Levitor: 3d trajectory oriented image-to-video synthesis

    Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang. Levitor: 3d trajectory oriented image-to-video synthesis. InCVPR, 2025

  31. [31]

    Boximator: Generating rich and controllable motions for video synthesis

    Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, and Hang Li. Boximator: Generating rich and controllable motions for video synthesis. InICML, 2024

  32. [32]

    Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation

    Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation. InSIGGRAPH, 2025

  33. [33]

    Drive- Dreamer: Towards real-world-driven world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- Dreamer: Towards real-world-driven world models for autonomous driving. InECCV, 2024

  34. [34]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024

  35. [35]

    Draganything: Motion control for anything using entity representation

    Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. InECCV, 2024. 11

  36. [36]

    Motioncanvas: Cinematic shot design with controllable image- to-video generation

    Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image- to-video generation. InSIGGRAPH, 2025

  37. [37]

    Depth anything v2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InNeurIPS, 2024

  38. [38]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. InICLR, 2025

  39. [39]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InICCV, 2023

  40. [40]

    The unreason- able effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

  41. [41]

    Tora: Trajectory-oriented diffusion transformer for video generation

    Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. In CVPR, 2025

  42. [42]

    Flextraj: Image-to-video generation with flexible point trajectory control.arXiv preprint arXiv:2510.08527, 2025

    Zhiyuan Zhang, Can Wang, Dongdong Chen, and Jing Liao. Flextraj: Image-to-video generation with flexible point trajectory control.arXiv preprint arXiv:2510.08527, 2025

  43. [43]

    Motionpro: A precise motion controller for image-to-video generation

    Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, and Tao Mei. Motionpro: A precise motion controller for image-to-video generation. InCVPR, 2025

  44. [44]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024. 12 Appendix Contents A Additional Qualitative Comparisons 13 B Failure Cases 16 B.1 GTA-V Training-Distribution Leakage on Out-of-Distribu...