pith. sign in

arxiv: 2605.19398 · v2 · pith:UNOHQWOFnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

Pith reviewed 2026-05-21 07:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image-to-video generationmotion enhancementattention rebalancingreference frame dominancedenoising processtraining-free methodvideo dynamics
0
0 comments X

The pith

Non-reference frames over-attend to the reference frame in image-to-video models, suppressing natural motion across time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that image-to-video models stay overly static because generated frames allocate too much self-attention to tokens from the single reference frame. This over-propagates the reference image's information through the sequence and crowds out changes between consecutive frames. The authors introduce DyMoS, a training-free adjustment that reduces this attention pathway only during the first few denoising steps. A single scalar then lets users dial motion strength up or down while the original model weights and input image stay untouched. Tests on several current I2V backbones show higher motion measures without loss of visual quality or reference fidelity.

Core claim

Reference-frame dominance arises when non-reference frames give excessive self-attention weight to reference-frame key tokens. This causes reference information to spread too strongly across time steps and damps inter-frame dynamics. DyMoS counters the effect by rebalancing the attention scores from generated frames back toward their own content during the initial denoising steps, using a single tunable scalar to control the strength of the correction.

What carries the argument

DyMoS, a scalar-controlled rebalancing of attention weights from generated frames to the reference frame applied only in early denoising steps.

If this is right

  • Motion strength becomes continuously adjustable in existing I2V models without retraining or changing the input image.
  • The same attention rebalancing can be applied to any current or future I2V backbone that uses similar frame-wise self-attention.
  • Visual fidelity to the reference image remains intact because only early denoising attention paths are modified.
  • The method requires no extra parameters beyond the single motion slider.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-step attention rebalancing might reduce dominance effects in other conditioned video or 3-D generation settings where one conditioning signal is disproportionately strong.
  • Extending the rebalancing window or making it content-adaptive could further tune motion for specific scene types.
  • Combining DyMoS with existing motion-regularization losses might produce additive gains in long-sequence coherence.

Load-bearing premise

Selectively lowering attention from generated frames to the reference frame only in the first denoising steps will increase inter-frame motion without creating new visual artifacts or lowering fidelity to the input image.

What would settle it

Apply DyMoS to a standard I2V model on a test set of reference images with known ground-truth motion; if measured optical-flow magnitude or frame-to-frame difference does not rise while perceptual quality and image similarity scores stay the same or drop, the mechanism is falsified.

Figures

Figures reproduced from arXiv: 2605.19398 by Hae-Gon Jeon, Hyeonho Jeong, Sangeyl Lee, Seungho Park, Seunghyun Shin, Wooseok Jeon.

Figure 1
Figure 1. Figure 1: Example videos from our method. We present DyMoS, a training-free and model-agnostic method for improving motion dynamics in image-to-video generation. (a) Comparison of generated videos from the same input image. DyMoS produces dynamic motion while preserving video quality. (b) Furthermore, DyMoS provides continuous control over motion dynamics. Abstract Image-to-video models often generate videos that re… view at source ↗
Figure 2
Figure 2. Figure 2: Reference-frame dominance in I2V self-attention. (a) Qualitative comparison between paired T2V and I2V generations. (b) Frame-to-frame self-attention difference map AI2V − AT2V, averaged over the first 10% of inference steps. images for an I2V model, using the same text prompts sourced from T2V-CompBench [32]. Since this setup ensures the same prompt and first frame for both models, we can analyze the diff… view at source ↗
Figure 3
Figure 3. Figure 3: Modulating reference-frame dominance controls motion dynamics. (a) Absolute difference in reference-frame attention between the vanilla I2V model and the modulated I2V model with γ = 0.6, measured over non-reference query frames. (b) Dynamic Degree and Video Quality as γ varies. The yellow star denotes the Dynamic Degree of the paired T2V generation. (c) T2V–I2V attention distance D(γ) measured by Jensen–S… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with vanilla I2V baseline and ALG. The leftmost images are the reference images. Our method (DyMoS) produces substantially more dynamic motion than the vanilla baseline and ALG while preserving fidelity to the reference image. In contrast, ALG introduces motion at the cost of visible degradation. steers the guided update away from reference-frame dominance while keeping the null-text… view at source ↗
Figure 5
Figure 5. Figure 5: Hyperparameter analysis and user study. (a) Effect of modulation strength γ. (b) Effect of modulation step ratio λ. (c) User study results. This indicates that our attention-level intervention improves motion dynamics more effectively than input-level attenuation, while preserving visual quality. Our method also achieves the highest ViCLIP scores among the compared methods, suggesting that reducing referen… view at source ↗
Figure 6
Figure 6. Figure 6: Continuous control over static-to-dynamic generation with DyMoS. Rows from top to bottom correspond to γ ∈ {−2, −1, 0, 0.6, 1.0}, with γ = 0 denoting the baseline. DyMoS to an appropriate number of initial denoising steps is sufficient to enhance motion, while switching back to the original attention computation helps maintain generation quality. 4.4 Application: Continuous control over motion dynamics A k… view at source ↗
Figure 7
Figure 7. Figure 7: Additional comparison results with the vanilla baseline and ALG. The leftmost images are the reference images. Our method qualitatively outperforms the vanilla baseline and ALG across various cases, demonstrating superior motion dynamics and visual fidelity. (Top) The vanilla baseline and ALG exhibit static scenes of a man riding a mountain bike. In contrast, our method generates fluid and natural riding m… view at source ↗
Figure 8
Figure 8. Figure 8: Additional comparison results with the vanilla baseline and ALG. The leftmost images are the reference images. (Top) ALG produces highly static scenes where the crab barely moves. In contrast, our method successfully synthesizes vivid and realistic motions. (Bottom) The vanilla baseline produces physically weird motions, such as the bird flying backwards. While ALG generates physically plausible movements,… view at source ↗
Figure 9
Figure 9. Figure 9: Additional comparison results with the vanilla baseline and ALG. The leftmost images are the reference images. (Top) Both the vanilla baseline and ALG struggle to synthesize dynamic motions, resulting in rigid and unnatural movements of the man. In contrast, our method generates natural motions of the man hoisting a spear. (Middle) Unlike the other methods, DyMoS successfully generates a semantically align… view at source ↗
Figure 10
Figure 10. Figure 10: Additional examples of continuous control over motion dynamics with DyMoS. Rows from top to bottom correspond to γ ∈ {−2, −1, 0, 0.6, 1.0}, with γ = 0 denoting the baseline. (Top Left) The movement of the person riding a horse transitions naturally across the static-to-dynamic spectrum as the parameter increases. (Top Right) Our guidance smoothly scales the tiger’s walking speed from slow to fast. Similar… view at source ↗
read the original abstract

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify reference-frame dominance as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS (Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper observes that non-reference frames in image-to-video (I2V) diffusion models allocate excessive self-attention to reference-frame key tokens, over-propagating static reference information and suppressing inter-frame dynamics. It proposes DyMoS (Dynamic Motion Slider), a training-free, model-agnostic intervention that rebalances attention from generated frames to the reference frame only during initial denoising steps, controlled by a single scalar motion-strength parameter. The method leaves the input image and model weights unchanged and is claimed to improve motion while preserving visual quality and reference fidelity across multiple SOTA I2V backbones.

Significance. If the empirical link between the targeted early-step attention rebalancing and improved motion holds under quantitative scrutiny, the work would offer a lightweight, plug-and-play solution for a common limitation in I2V models. The training-free and model-agnostic design, together with continuous scalar control, are clear strengths that distinguish it from retraining-heavy alternatives and could enable rapid integration into existing pipelines.

major comments (1)
  1. [Method / DyMoS description] The restriction of rebalancing to initial denoising steps is load-bearing for the central claim, because the paper argues that coarse motion structure forms early while later steps refine temporal consistency. No ablation varying the step window or comparing against full-trajectory rebalancing is reported, leaving open whether observed motion gains arise specifically from this timing choice or from a general, temporary weakening of the reference signal.
minor comments (1)
  1. [Abstract] The abstract asserts consistent improvements across backbones but provides no quantitative metrics (e.g., motion scores, FID, or user-study results) or baseline comparisons, which would be needed to substantiate the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of our training-free, model-agnostic approach. We address the major comment below and have revised the manuscript to incorporate additional experiments that directly respond to the concern.

read point-by-point responses
  1. Referee: The restriction of rebalancing to initial denoising steps is load-bearing for the central claim, because the paper argues that coarse motion structure forms early while later steps refine temporal consistency. No ablation varying the step window or comparing against full-trajectory rebalancing is reported, leaving open whether observed motion gains arise specifically from this timing choice or from a general, temporary weakening of the reference signal.

    Authors: We agree that an explicit ablation on the timing window would strengthen the central claim. Our design choice is grounded in the well-established property of diffusion models that early denoising steps determine coarse structure (including motion layout) while later steps primarily refine appearance and temporal consistency; this is why we restrict rebalancing to the initial phase. To address the referee's point directly, we have added a new ablation study in the revised manuscript (Section 4.3 and Appendix C) that varies the rebalancing window (steps 1-10, 1-20, 1-30, and full trajectory) and compares against a constant full-trajectory baseline. The results confirm that restricting intervention to the earliest steps yields the largest motion gains with negligible fidelity loss, whereas extending rebalancing across the full trajectory degrades reference fidelity (as measured by CLIP similarity and perceptual metrics). We have also updated the method description to include this empirical justification and the corresponding attention-map visualizations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation directly motivates training-free intervention

full rationale

The paper's central claim rests on an empirical attention observation (non-reference frames over-attend to reference keys) followed by a direct, training-free rebalancing method (DyMoS) applied selectively in early denoising steps. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the core mechanism. The method is explicitly model-agnostic and leaves weights and input image unchanged, making the derivation self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that reference-frame attention dominance is the primary cause of motion suppression and on the introduction of one tunable scalar whose effect is validated only at the level of the abstract.

free parameters (1)
  • motion strength scalar
    Single scalar parameter introduced to provide continuous control over motion strength during the rebalancing operation.
axioms (1)
  • domain assumption Reference-frame dominance via excessive self-attention to reference key tokens is the key mechanism suppressing inter-frame dynamics in I2V models
    The paper states this as the identified cause of the static-video problem.

pith-pipeline@v0.9.0 · 5731 in / 1232 out tokens · 41509 ms · 2026-05-21T07:51:54.237308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 11 internal anchors

  1. [1]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  2. [2]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=PxTIG12RRHS

  3. [3]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

  4. [4]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

  5. [5]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  6. [6]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

  7. [7]

    Building normalizing flows with stochastic interpolants

    Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=li7qeBbCR1t

  8. [8]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  9. [9]

    HunyuanVideo 1.5 Technical Report

    Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. URL https://arxiv.org/abs/2511.18870

  10. [10]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  11. [11]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...

  12. [12]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  13. [13]

    Video diffusion models.Advances in neural information processing systems, 35: 8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35: 8633–8646, 2022

  14. [14]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

  15. [15]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 10

  16. [16]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Fx2SbBgcte

  17. [17]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  18. [18]

    Conditional image-to-video generation with latent flow diffusion models

    Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18444–18455. IEEE, 2023

  19. [19]

    Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

    Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  20. [20]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  21. [21]

    Dynamicrafter: Animating open-domain images with video diffusion priors

    Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–417. Springer, 2024

  22. [22]

    I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

    Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models.arXiv preprint arXiv:2311.04145, 2023

  23. [23]

    Identifying and solving conditional image leakage in image-to-video diffusion model

    Min Zhao, Hongzhou Zhu, Chendong Xiang, Kaiwen Zheng, Chongxuan Li, and Jun Zhu. Identifying and solving conditional image leakage in image-to-video diffusion model. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=o9Lkiv1qpc

  24. [24]

    Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

    Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, and Li Yuan. Flashi2v: Fourier-guided latent shifting prevents conditional image leakage in image-to-video generation.arXiv preprint arXiv:2509.25187, 2025

  25. [25]

    Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

    June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, and Kimin Lee. Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

  26. [26]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  27. [27]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

  28. [28]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  29. [29]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  30. [30]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  31. [31]

    History-guided video diffusion

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. InForty-second International Conference on Machine Learning,

  32. [32]

    URLhttps://openreview.net/forum?id=j8Vr3E3vhy. 11

  33. [33]

    T2v- compbench: A comprehensive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

  34. [34]

    Motioncfg: Boosting motion dynamics via stochastic concept perturbation.arXiv preprint arXiv:2603.14073, 2026

    Byungjun Kim, Soobin Um, and Jong Chul Ye. Motioncfg: Boosting motion dynamics via stochastic concept perturbation.arXiv preprint arXiv:2603.14073, 2026

  35. [35]

    Motion prior distillation in time reversal sampling for generative inbetweening

    Wooseok Jeon, Seunghyun Shin, Dongmin Shin, and Hae-Gon Jeon. Motion prior distillation in time reversal sampling for generative inbetweening. InThe Fourteenth International Con- ference on Learning Representations, 2026. URL https://openreview.net/forum?id= GRElsj9W2t

  36. [36]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  37. [37]

    Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 2002

    Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 2002

  38. [38]

    Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

    Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024

  39. [39]

    Internvid: A large-scale video-text dataset for multimodal understanding and generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.n...

  40. [40]

    Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11269–11277, 2026

  41. [41]

    Perception encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Abdul Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Shang-Wen Li, Piotr Dollar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. In The...

  42. [42]

    Vidprom: A million-scale real prompt-gallery dataset for text-to- video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

    Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to- video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

  43. [43]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  44. [44]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020

  45. [45]

    Amazon mechanical turk: A research tool for organizations and information systems scholars

    Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. InShaping the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, December 13-14, 2012. Proceedings, 2012. 12 A Full algorithm of DyMoS Algorithm 1DyMoS Input: Reference image Iref, text prompt c, total i...

  46. [46]

    Motion:Which video has the most dynamic and realistic motion? Examples include water ripples, cloth movement, human action, and camera motion

  47. [47]

    Fidelity:Which video best preserves the appearance of the reference image throughout the sequence? Examples include the subject, background, and colors. 3.Text alignment:Which video most faithfully reflects the content described in the text prompt? 4.Overall preference:Overall, which video do you prefer? We collect 30 responses for each question over 25 r...