pith. sign in

arxiv: 2607.01869 · v1 · pith:7RMDPCJKnew · submitted 2026-07-02 · 💻 cs.CV

QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers

Pith reviewed 2026-07-03 16:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords motion controlvideo diffusion transformerstraining-freequery warpingimage-to-videooptical flowattention manipulation
0
0 comments X

The pith

Warping the queries in video diffusion transformers enables training-free motion control comparable to fine-tuned models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that explicit motion control can be added to pretrained image-to-video diffusion transformers without any training or fine-tuning. It does this by warping the frame-invariant semantic subspace of the queries inside the model's 3D full attention, using user-defined object warping and optical flow as input. The resulting noise prediction then steers the diffusion process toward the desired motion, and feeding that noise back as self-guidance further stabilizes the output. A sympathetic reader would care because the approach avoids the data and compute costs of fine-tuning while keeping the original model's generative quality intact.

Core claim

By warping the frame-invariant semantic subspace of queries in the 3D full attention of image-to-video DiTs, the noise predicted by the model naturally guides the diffusion trajectory toward the user-specified motion; leveraging this noise as self-guidance for latent optimization improves control stability and visual quality.

What carries the argument

Query warping applied to the frame-invariant semantic subspace inside the 3D full attention, which incorporates user-specified object warping and optical flow to redirect the model's noise prediction.

If this is right

  • QWERTY achieves the most effective motion control among existing training-free approaches on a recent image-to-video DiT.
  • Performance reaches levels comparable to fine-tuning-based methods.
  • Using the predicted noise as self-guidance for latent optimization improves both control stability and visual quality.
  • The framework supports flexible motion control through arbitrary user-defined object warping and optical flow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The query-warping technique might transfer to other transformer-based diffusion models for controlling attributes beyond motion.
  • If the warping step can be made faster, the method could support interactive video editing sessions.
  • Testing the approach on videos with multiple interacting objects would reveal whether the single-object warping assumption holds in more complex scenes.

Load-bearing premise

That warping the queries will make the predicted noise steer generation to the desired motion without breaking temporal coherence or overall visual quality.

What would settle it

A test video where the query-warped model produces motion that deviates from the supplied optical flow or object trajectories, or where visual quality drops below the unwarped baseline.

Figures

Figures reproduced from arXiv: 2607.01869 by Chanyoung Kim, Geunrip Park, Hyunkyung Han, Kyobin Choo, Seong Jae Hwang, Sunyoung Jung, Youngmin Kim.

Figure 1
Figure 1. Figure 1: We present Qwerty, a training-free framework that controls the motion of an image-to-video diffusion transformer (DiT) according to user-defined warping. By warping the queries of the DiT at inference time, Qwerty steers motion while pre￾serving the generative fidelity of the pretrained backbone. Our framework enables (a) fine-grained object control by translating, rotating, and scaling masks drawn on the … view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the two key components in our training-free video motion control pipeline. (a) Our channel decomposition technique disentangles frame-inconsistent DiT query features into semantic and temporal subspace. The filtered semantic channels are frame-consistent and spatially discriminative, making them amenable to explicit motion control. (b) We warp queries to manipulate the 3D full attention of … view at source ↗
Figure 3
Figure 3. Figure 3: A conceptual illustration of how query warping guides motion. Warping takes tokens from the object’s source region in frame 1 and pastes them at warped positions in frame n. Because the softmax is applied over keys, warping only the keys (a) or both queries and keys (b) does not establish a clear correspondence between the source and target regions. In (c), query warping concentrates attention from the tar… view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline overview. (a) We perform diffusion steps using motion-inducing noise predicted from the query-warped video DiT. This query-warped noise is also used as a guidance signal to optimize the input latent, further stabilizing control. (b) We isolate frame-consistent semantic channels from the queries and warp them together with only the spatial axis of the 3D RoPE, delicately reshaping the attention dis… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons of local object motion control. Polygonal object region masks and user-defined warps are given as input. First Frame QWERTY Last Frame First Frame Last Frame Original Video MotionClone MOFT SG-I2V Vanilla Wan GWTF [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons of camera motion control. For evaluation, we use opti￾cal flow estimated from the original video. In practice, users can instead provide optical flow derived from a depth map and camera trajectory (Sec. 5.3). that use bounding-box trajectories to control object or camera motion. Free￾Traj [36], originally designed for text-to-video (T2V), is adapted to the I2V setting and evaluated … view at source ↗
Figure 7
Figure 7. Figure 7: Failure of U-Net–based motion-control methods on DiT. (a) Noise Warping col￾lapses generation. (b) SG-I2V yields mostly static results, with occasional unintended motion. In contrast, Qwerty achieves reliable object and camera control on DiT. Object control [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Video diffusion transformers (DiTs) generate high-fidelity and temporally coherent videos, yet motion control remains implicit, primarily relying on text prompts. As a result, achieving desired motion often requires extensive prompt engineering and repeated resampling. While fine-tuning models with additional spatial prompts (e.g., bounding boxes or point trajectories) enables explicit control, it demands substantial data curation and computation, and may compromise the generative capabilities of pretrained models. Consequently, training-free motion control using such spatial prompts has been explored in U-Net-based video diffusion models, but remains largely unexplored for DiTs. We introduce QWERTY, a training-free framework that enables flexible motion control in pretrained image-to-video DiTs via user-defined object warping and optical flow. We carefully manipulate the 3D full attention of DiTs by warping the frame-invariant semantic subspace of queries. We find that the noise predicted by the query-warped DiT naturally guides the diffusion trajectory toward the desired motion, and further show that leveraging this noise as self-guidance for latent optimization improves control stability and visual quality. Experiments show that QWERTY achieves the most effective motion control among existing training-free approaches on a recent image-to-video DiT, with performance comparable to fine-tuning-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces QWERTY, a training-free framework for explicit motion control in pretrained image-to-video Diffusion Transformers (DiTs). It manipulates the 3D full attention by warping the frame-invariant semantic subspace of queries according to user-specified object warping and optical flow. The central claim is that the resulting noise prediction naturally steers the diffusion trajectory toward the desired motion without additional training; the authors further propose using this noise as self-guidance for latent optimization to improve stability and quality. Experiments are reported to show that QWERTY outperforms existing training-free baselines on a recent image-to-video DiT while achieving performance comparable to fine-tuning-based methods.

Significance. If the empirical results hold under rigorous controls, the work would be significant for demonstrating an architecture-specific, training-free control mechanism that extends prior U-Net work to DiTs. It directly addresses the practical limitation of implicit motion control in high-fidelity video generators by leveraging the query subspace of 3D attention, potentially lowering the barrier to precise spatial control while preserving pretrained generative capabilities.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (method description): the claim that 'the noise predicted by the query-warped DiT naturally guides the diffusion trajectory' is presented as an empirical finding rather than a derived property. Without an explicit derivation or ablation isolating the contribution of query warping from other factors (e.g., the optical-flow input or the self-guidance step), it is difficult to assess whether the observed steering is a general consequence of the manipulation or specific to the chosen DiT and prompts.
  2. [Experiments] Experiments section (implied by abstract claims): the assertion of 'most effective motion control among existing training-free approaches' and 'performance comparable to fine-tuning-based methods' requires the full set of baselines, metrics, and statistical controls to be inspectable. If the evaluation uses post-hoc selection of prompts or omits variance across random seeds, the comparative claim would be weakened.
minor comments (2)
  1. [§3] Notation for the warped query subspace and the precise definition of 'frame-invariant semantic subspace' should be formalized with an equation to avoid ambiguity when reproducing the attention modification.
  2. [Abstract] The abstract would benefit from one or two concrete quantitative results (e.g., a specific metric improvement over the strongest baseline) to ground the superiority claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address the major comments point-by-point below.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method description): the claim that 'the noise predicted by the query-warped DiT naturally guides the diffusion trajectory' is presented as an empirical finding rather than a derived property. Without an explicit derivation or ablation isolating the contribution of query warping from other factors (e.g., the optical-flow input or the self-guidance step), it is difficult to assess whether the observed steering is a general consequence of the manipulation or specific to the chosen DiT and prompts.

    Authors: We acknowledge that the steering effect is presented as an empirical observation ('We find that...') rather than a formal derivation. A complete theoretical derivation of how query warping in 3D attention alters the noise prediction would require extensive analysis of DiT attention dynamics, which lies beyond the paper's scope. To strengthen the claim, the revised manuscript will include new ablations that isolate query warping (comparing warped vs. non-warped queries while holding optical flow and self-guidance constant) across multiple DiT variants and prompt sets. These will clarify the contribution of the manipulation. revision: partial

  2. Referee: [Experiments] Experiments section (implied by abstract claims): the assertion of 'most effective motion control among existing training-free approaches' and 'performance comparable to fine-tuning-based methods' requires the full set of baselines, metrics, and statistical controls to be inspectable. If the evaluation uses post-hoc selection of prompts or omits variance across random seeds, the comparative claim would be weakened.

    Authors: The experiments section already includes all listed training-free and fine-tuning baselines evaluated with standard motion accuracy and perceptual quality metrics on a fixed prompt set (no post-hoc selection). Results are averaged over multiple random seeds with standard deviations reported. In the revision we will add an explicit supplementary table listing every baseline implementation detail, all metric definitions, the complete prompt list, and p-values from statistical significance tests to make the controls fully inspectable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces QWERTY as a direct manipulation of 3D full attention queries in pretrained DiTs, with the claim that query warping causes predicted noise to steer diffusion trajectories presented as an empirical observation ('We find that...') rather than a derived result. No equations, fitted parameters, or self-citations appear in the provided text that would reduce any 'prediction' or central mechanism to the inputs by construction. Experiments are described as external validation against baselines, with no load-bearing step that collapses to tautology or renaming. This is the common case of a self-contained empirical method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the core mechanism assumes the existence of a separable 'frame-invariant semantic subspace' within queries but provides no derivation or external grounding for this separation.

pith-pipeline@v0.9.1-grok · 5772 in / 1149 out tokens · 19805 ms · 2026-07-03T16:17:39.769073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 26 canonical work pages · 13 internal anchors

  1. [1]

    arXiv preprint arXiv:1412.69801412(6) (2014)

    Adam, K.D.B.J., et al.: A method for stochastic optimization. arXiv preprint arXiv:1412.69801412(6) (2014)

  2. [2]

    arXiv preprint arXiv:2412.07750 (2024)

    Atzmon, Y., Gal, R., Tewel, Y., Kasten, Y., Chechik, G.: Motion by queries: Identity-motion trade-offs in text-to-video generation. arXiv preprint arXiv:2412.07750 (2024)

  3. [3]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Bai, J., He, T., Wang, Y., Guo, J., Hu, H., Liu, Z., Bian, J.: Uniedit: A unified tuning-free framework for video motion and appearance editing. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 10171–10180 (2025)

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  5. [5]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Bochkovskii, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S.R., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073 (2024)

  6. [6]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Burgert, R., Xu, Y., Xian, W., Pilarski, O., Clausen, P., He, M., Ma, L., Deng, Y., Li, L., Mousavi, M., et al.: Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13–23 (2025)

  7. [7]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)

  8. [8]

    arXiv preprint arXiv:2311.12886 (2023)

    Dai, Z., Zhang, Z., Yao, Y., Qiu, B., Zhu, S., Qin, L., Wang, W.: Animateanything: Fine-grained open domain image animation with motion guidance. arXiv preprint arXiv:2311.12886 (2023)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y., Rubinstein, M., Sun, C., et al.: Motion prompting: Controlling video generation with motion trajectories. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1–12 (2025)

  10. [10]

    arXiv preprint arXiv:2505.13344 (2025)

    Gokmen, A.B., Ekin, Y., Bilecen, B.B., Dundar, A.: Ropecraft: Training-free mo- tion transfer with trajectory-guided rope optimization on diffusion transformers. arXiv preprint arXiv:2505.13344 (2025)

  11. [11]

    Google DeepMind: Veo 3: A generative video model by google deepmind.https: //aistudio.google.com/models/veo-3(2024), accessed: 2025-11-13

  12. [12]

    In: The Twelfth International Conference on Learning Representations (2024)

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In: The Twelfth International Conference on Learning Representations (2024)

  13. [13]

    In: The Thirteenth International Conference on Learning Representations (2025)

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- abling camera control for video diffusion models. In: The Thirteenth International Conference on Learning Representations (2025)

  14. [14]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  15. [15]

    Hou,C.,Chen,Z.:Training-freecameracontrolforvideogeneration.arXivpreprint arXiv:2406.10126 (2024)

  16. [16]

    Training-Free Motion Control via Query-Warped Video DiTs 17 In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Huang, Y., Chen, Y., Ding, L., Zhang, X., Dai, W., Zou, J., Xiong, H., Tian, Q.: Im-zero: Instance-level motion controllable video generation in a zero-shot manner. Training-Free Motion Control via Query-Warped Video DiTs 17 In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7265–7275 (2025)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jain, Y., Nasery, A., Vineet, V., Behl, H.: Peekaboo: Interactive video generation via masked-diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8079–8088 (2024)

  19. [19]

    In: European conference on computer vision

    Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: European conference on computer vision. pp. 18–35. Springer (2024)

  20. [20]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  21. [21]

    arXiv preprint arXiv:2503.16421 (2025)

    Li, Q., Xing, Z., Wang, R., Zhang, H., Dai, Q., Wu, Z.: Magicmotion: Control- lable video generation with dense-to-sparse trajectory guidance. arXiv preprint arXiv:2503.16421 (2025)

  22. [22]

    arXiv preprint arXiv:2507.02857 (2025)

    Li, Z., Luo, H., Shuai, X., Ding, H.: Anyi2v: Animating any conditional image with motion control. arXiv preprint arXiv:2507.02857 (2025)

  23. [23]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22160–22169 (2024)

  25. [25]

    In: The Thirteenth International Conference on Learning Representations

    Ling, P., Bu, J., Zhang, P., Dong, X., Zang, Y., Wu, T., Chen, H., Wang, J., Jin, Y.: Motionclone: Training-free motion cloning for controllable video generation. In: The Thirteenth International Conference on Learning Representations

  26. [26]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  27. [27]

    Latte: Latent Diffusion Transformer for Video Generation

    Ma, X., Wang, Y., Jia, G., Chen, X., Liu, Z., Li, Y.F., Chen, C., Qiao, Y.: Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048 (2024)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Miao, J., Wang, X., Wu, Y., Li, W., Zhang, X., Wei, Y., Yang, Y.: Large-scale video panoptic segmentation in the wild: A benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21033– 21043 (2022)

  29. [29]

    arXiv preprint arXiv:2506.17220 (2025)

    Nam, J., Son, S., Chung, D., Kim, J., Jin, S., Hur, J., Kim, S.: Emer- gent temporal correspondences from video diffusion transformers. arXiv preprint arXiv:2506.17220 (2025)

  30. [30]

    arXiv preprint arXiv:2411.04989 (2024)

    Namekata, K., Bahmani, S., Wu, Z., Kant, Y., Gilitschenski, I., Lindell, D.B.: Sg-i2v: Self-guided trajectory control in image-to-video generation. arXiv preprint arXiv:2411.04989 (2024)

  31. [31]

    In: European Conference on Computer Vision

    Niu, M., Cun, X., Wang, X., Zhang, Y., Shan, Y., Zheng, Y.: Mofa-video: Control- lable image animation via generative motion field adaptions in frozen image-to- video diffusion model. In: European Conference on Computer Vision. pp. 111–128. Springer (2024) 18 K. Choo et al

  32. [32]

    com/index/video-generation-models-as-world-simulators/(2024), accessed: 2025-11-13

    OpenAI: Sora: Video generation models as world simulators.https://openai. com/index/video-generation-models-as-world-simulators/(2024), accessed: 2025-11-13

  33. [33]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Park, G.Y., Jeong, H., Lee, S.W., Ye, J.C.: Spectral motion alignment for video motion transfer using diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 6398–6405 (2025)

  34. [34]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  35. [35]

    Pondaven, A., Siarohin, A., Tulyakov, S., Torr, P., Pizzati, F.: Video motion trans- ferwithdiffusiontransformers.In:ProceedingsoftheComputerVisionandPattern Recognition Conference. pp. 22911–22921 (2025)

  36. [36]

    Freetraj: Tuning-free trajectory control in video diffusion models.arXiv preprint arXiv:2406.16863, 2024

    Qiu, H., Chen, Z., Wang, Z., He, Y., Xia, M., Liu, Z.: Freetraj: Tuning-free trajec- tory control in video diffusion models. arXiv preprint arXiv:2406.16863 (2024)

  37. [37]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)

  38. [38]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  39. [39]

    Advances in neural information processing systems35, 36479–36494 (2022)

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

  40. [40]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Shi, Q., Wu, J., Bai, J., Zhang, J., Qi, L., Tong, Y., Li, X.: Decouple and track: Benchmarking and improving video diffusion transformers for motion transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10995–11005 (2025)

  41. [41]

    In: ACM SIGGRAPH 2024 Conference Papers

    Shi, X., Huang, Z., Wang, F.Y., Bian, W., Li, D., Zhang, Y., Zhang, M., Che- ung, K.C., See, S., Qin, H., et al.: Motion-i2v: Consistent and controllable image- to-video generation with explicit motion modeling. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

  42. [42]

    Shi, Y., Xue, C., Liew, J.H., Pan, J., Yan, H., Zhang, W., Tan, V.Y., Bai, S.: Dragdiffusion: Harnessing diffusion models for interactive point-based image edit- ing.In:ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPattern Recognition. pp. 8839–8849 (2024)

  43. [43]

    Neurocomputing568, 127063 (2024)

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

  44. [44]

    Advances in Neural Information Processing Systems 36, 1363–1389 (2023)

    Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspon- dence from image diffusion. Advances in Neural Information Processing Systems 36, 1363–1389 (2023)

  45. [45]

    In: European conference on computer vision

    Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. pp. 402–419. Springer (2020)

  46. [46]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

  47. [47]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  48. [48]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025) Training-Free Motion Control via Query-Warped Video DiTs 19

  49. [49]

    arXiv preprint arXiv:2412.07721 (2024)

    Wang, Z., Lan, Y., Zhou, S., Loy, C.C.: Objctrl-2.5 d: Training-free object control with camera poses. arXiv preprint arXiv:2412.07721 (2024)

  50. [50]

    In: ACM SIGGRAPH 2024 Conference Papers

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

  51. [51]

    In: European Conference on Computer Vision

    Wu, W., Li, Z., Gu, Y., Zhao, R., He, Y., Zhang, D.J., Shou, M.Z., Li, Y., Gao, T., Zhang, D.: Draganything: Motion control for anything using entity representation. In: European Conference on Computer Vision. pp. 331–348. Springer (2024)

  52. [52]

    In: The Thirteenth International Conference on Learning Representa- tions (2024)

    Xiao, F., Liu, X., Wang, X., Peng, S., Xia, M., Shi, X., Yuan, Z., Wan, P., Zhang, D., Lin, D.: 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation. In: The Thirteenth International Conference on Learning Representa- tions (2024)

  53. [53]

    Advances in Neural Information Processing Systems 37, 76115–76138 (2024)

    Xiao, Z., Zhou, Y., Yang, S., Pan, X.: Video diffusion models are training-free mo- tion interpreter and controller. Advances in Neural Information Processing Systems 37, 76115–76138 (2024)

  54. [54]

    In: European Conference on Computer Vision

    Xing,J.,Xia,M.,Zhang,Y.,Chen,H.,Yu,W.,Liu,H.,Liu,G.,Wang,X.,Shan,Y., Wong, T.T.: Dynamicrafter: Animating open-domain images with video diffusion priors. In: European Conference on Computer Vision. pp. 399–417. Springer (2024)

  55. [55]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

  56. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yatim, D., Fridman, R., Bar-Tal, O., Kasten, Y., Dekel, T.: Space-time diffu- sion features for zero-shot text-driven motion transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8466– 8476 (2024)

  57. [57]

    arXiv preprint arXiv:2412.05355 (2024)

    Yesiltepe, H., Meral, T.H.S., Dunlop, C., Yanardag, P.: Motionshop: Zero-shot motion transfer in video diffusion models with mixture of score guidance. arXiv preprint arXiv:2412.05355 (2024)

  58. [58]

    Yin, S., Wu, C., Liang, J., Shi, J., Li, H., Ming, G., Duan, N.: Dragnuwa: Fine- grainedcontrolinvideogenerationbyintegratingtext,image,andtrajectory.arXiv preprint arXiv:2308.08089 (2023)

  59. [59]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Yu, S., Fang, J.Z., Zheng, J., Sigurdsson, G., Ordonez, V., Piramuthu, R., Bansal, M.: Zero-shot controllable image-to-video animation via motion decomposition. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 3332–3341 (2024)

  60. [60]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image dif- fusion models. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3813–3824. IEEE (2023)

  61. [61]

    arXiv preprint arXiv:2501.07563 (2025)

    Zhang, X., Duan, Z., Gong, D., Liu, L.: Training-free motion-guided video gen- eration with enhanced temporal consistency using motion consistency loss. arXiv preprint arXiv:2501.07563 (2025)

  62. [62]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhang, Z., Liao, J., Li, M., Dai, Z., Qiu, B., Zhu, S., Qin, L., Wang, W.: Tora: Trajectory-oriented diffusion transformer for video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2063–2073 (2025)

  63. [63]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhang, Z., Long, F., Qiu, Z., Pan, Y., Liu, W., Yao, T., Mei, T.: Motionpro: A precise motion controller for image-to-video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27957–27967 (2025)

  64. [64]

    Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Zhang, Y., He, J., Zheng, W.S., Qiao, Y., Liu, Z.: VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025) Supplementary Material We recommend readers to watch the accompanying.mp4videos included in the supplementary material, as st...

  65. [65]

    key-warping

    and RoPECraft [10]. All other models are implemented using their official public repositories for a fair comparison. All experiments are conducted on a single NVIDIA RTX A6000 GPU with 48GB memory. Below, we describe the implementation details ofQwertyand the baseline methods. B.1 Our Method Qwertyis implemented on both the Wan 2.2 TI2V-5B [47] and CogVid...

  66. [66]

    ping- pong

    The input data format was identical to that used for SG-I2V. Videos were generated at a fixed resolution of 576×1024. The number of generated frames matched the input sequence length, but sequences longer than 25 frames were truncated to 25 due to memory constraints. MotionClone.We follow the official MotionClone [25] implementation and only replace its A...