pith. machine review for the scientific record. sign in

arxiv: 2605.14136 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords temporal coherencevideo diffusionself-attention mapstraining-free optimizationlatent updatesmotion consistencydiffusion transformers
0
0 comments X

The pith

TeDiO improves temporal coherence in video diffusion by smoothing irregular diagonals in self-attention maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion transformers can generate good-looking frames but often fail at consistent motion across time, resulting in flickering or drifting. The paper identifies that these problems appear as broken or irregular diagonal patterns in the self-attention maps computed inside the model. TeDiO addresses this by measuring the smoothness of those diagonals during generation, locating the problematic areas, and applying small changes to the latent codes to encourage smooth band-diagonal attention. This regularization happens at inference time without any model retraining or external motion data. Tests on models including Wan2.1 and CogVideoX show smoother videos with no loss in individual frame quality.

Core claim

Incoherent videos exhibit irregular, fragmented temporal diagonals in intermediate self-attention maps, while coherent motion shows smooth band-diagonal patterns. TeDiO reinforces temporal consistency by estimating diagonal smoothness, identifying unstable regions, and performing lightweight latent updates to promote coherent frame-to-frame dynamics.

What carries the argument

TeDiO, a training-free optimization that regularizes temporal diagonals in self-attention maps through targeted latent updates.

If this is right

  • Markedly smoother motion in generated videos across multiple diffusion models.
  • Preservation of per-frame visual quality.
  • Applicable as a plug-and-play addition to existing video generation pipelines.
  • Efficient inference-time improvement without weight modifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attention map diagnostics could be used to detect other types of generation failures in diffusion models.
  • The approach might extend to improving coherence in other sequential generation tasks such as audio synthesis.
  • Future work could explore whether similar diagonal regularization applies to cross-attention layers.
  • Testing on a wider range of video lengths and motion complexities would clarify the method's limits.

Load-bearing premise

That the primary cause of temporal incoherence is irregular temporal diagonals in self-attention and that lightweight latent updates can reliably smooth them without creating new artifacts.

What would settle it

Generate videos with deliberately introduced temporal artifacts and observe whether TeDiO fails to smooth the attention diagonals or reduces visual quality in those cases.

Figures

Figures reproduced from arXiv: 2605.14136 by Gedas Bertasius, Heather Yu, Marc Niethammer, Nurislam Tursynbek, Zhiqiang Lao.

Figure 1
Figure 1. Figure 1: TeDiO, our training-free inference-time method, substantially improves motion coherence in state-of-the-art text-to-video diffusion transformers. Shown are comparisons on Wan2.1 and CogVideoX: without TeDiO, generations suffer from jittery inconsistent motion, merging objects, subject duplication, and physically implausible dynamics. TeDiO resolves these artifacts — producing temporally stable, coherent vi… view at source ↗
Figure 2
Figure 2. Figure 2: Temporal attention maps reflects motion coherence. Incoherent videos often exhibit flickering or abrupt transitions (top row), whereas coherent videos show smooth and stable dynamics (bottom row). Their corresponding temporal attention maps reveal a structural difference: incoherent motion produces irregular, fragmented diagonals, while coherent motion shows clean, band-diagonal patterns that indicate stab… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of TeDiO. During the sampling timestep t, the queries (Q) and the keys (K) from block i of frozen ( ) Diffusion Transformer (DiT) are reshaped to isolate temporal interactions. A lightweight inference-time optimization adjusts the latent zt ( ) using the LTeDiO loss, encouraging smooth, band-diagonal patterns in temporal attention—indicative of stable frame-to-frame dependencies. This process requ… view at source ↗
Figure 4
Figure 4. Figure 4: TeDiO improves motion coherence, flow consistency, and smooths diagonal spikes, while preserving motion patterns. These results demonstrate enhanced frame-to-frame sta￾bility and reduced representational drift - crucially, achieved without the need for model fine-tuning or additional supervi￾sion. As expected, Dynamic Degree (DD) decreases, reflect￾ing the known trade-off between temporal smoothness and mo… view at source ↗
Figure 5
Figure 5. Figure 5: Across VideoJAM-Bench [11] prompts, TeDiO produces more temporally stable and smoother videos than the base models. Wan2.1 CogVideoX 55.35 57.45 18.18 20.61 26.47 21.94 Baseline Tie Baseline+TeDiO [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: User preference study on temporal coherence. Partici￾pants selected which video had better temporal consistency. User Study. Quantitative metrics offer important diag￾nostic signals but do not always reflect human perception of temporal smoothness. To complement automatic evalu￾ation, we conducted a perceptual study using VideoJAM￾Bench [11]. 10 participants on Prolific [1] viewed paired video generations … view at source ↗
read the original abstract

Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper observes that temporal incoherence in video diffusion transformers manifests as irregular, fragmented temporal diagonals in intermediate self-attention maps, while coherent motion produces smooth band-diagonal patterns. It introduces TeDiO, a training-free inference-time method that estimates diagonal smoothness, identifies unstable regions, and applies lightweight latent updates to promote coherent frame-to-frame dynamics without modifying weights or using external supervision. The method is evaluated across models such as Wan2.1 and CogVideoX, claiming markedly smoother motion while preserving per-frame visual quality.

Significance. If the central empirical claim holds after proper validation, TeDiO would offer a lightweight, plug-and-play inference-time intervention that improves temporal coherence in existing pre-trained video diffusion models. This could be practically significant for video generation pipelines where retraining is costly, provided the diagonal-regularization mechanism is shown to be causal rather than incidental.

major comments (3)
  1. [Experiments / Ablation studies] The central claim that irregular temporal diagonals are the primary cause of flickering (and that targeting them via latent updates reliably restores coherence) requires explicit ablation isolating the diagonal smoothness term. Without comparisons to generic latent optimization or alternative attention regularizers, it remains unclear whether the specific diagonal focus is load-bearing or whether any consistency-promoting update would suffice.
  2. [Results / Quantitative evaluation] Quantitative support for 'markedly smoother motion' is not detailed in the provided description. The results section must report specific metrics (e.g., temporal consistency scores, optical-flow variance, user-study percentages) with statistical significance and direct comparisons to baselines and prior training-free methods; absence of these numbers leaves the improvement claim unverified.
  3. [Method / TeDiO formulation] The method description must supply the precise formulation of the diagonal-smoothness estimator and the latent-update objective (including any hyperparameters). Without these equations, it is impossible to assess whether the procedure is truly parameter-free or model-agnostic as asserted.
minor comments (2)
  1. [Method] Clarify the exact definition of 'temporal diagonal' (e.g., which attention heads/layers and how the band width is chosen) to avoid ambiguity in replication.
  2. [Discussion] Add a limitations paragraph discussing potential failure modes, such as degradation on highly dynamic scenes or interaction with classifier-free guidance scales.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where revisions are needed to strengthen the manuscript, we will incorporate them in the next version.

read point-by-point responses
  1. Referee: [Experiments / Ablation studies] The central claim that irregular temporal diagonals are the primary cause of flickering (and that targeting them via latent updates reliably restores coherence) requires explicit ablation isolating the diagonal smoothness term. Without comparisons to generic latent optimization or alternative attention regularizers, it remains unclear whether the specific diagonal focus is load-bearing or whether any consistency-promoting update would suffice.

    Authors: We agree that explicit ablations are necessary to isolate the contribution of the diagonal smoothness term. In the revised manuscript we will add new experiments comparing TeDiO against (i) generic latent-space optimization without the diagonal term and (ii) alternative attention regularizers (e.g., total-variation on attention maps). These results will demonstrate that the temporal-diagonal focus is load-bearing for the observed coherence gains. revision: yes

  2. Referee: [Results / Quantitative evaluation] Quantitative support for 'markedly smoother motion' is not detailed in the provided description. The results section must report specific metrics (e.g., temporal consistency scores, optical-flow variance, user-study percentages) with statistical significance and direct comparisons to baselines and prior training-free methods; absence of these numbers leaves the improvement claim unverified.

    Authors: We will expand the results section with a new table reporting concrete values: temporal consistency scores (CLIP-based and feature-based), optical-flow variance, and user-study preference percentages (with 95% confidence intervals and p-values). Direct comparisons to prior training-free baselines will be included. The full manuscript already contains some of these metrics; the revision will present them more prominently and with statistical tests. revision: yes

  3. Referee: [Method / TeDiO formulation] The method description must supply the precise formulation of the diagonal-smoothness estimator and the latent-update objective (including any hyperparameters). Without these equations, it is impossible to assess whether the procedure is truly parameter-free or model-agnostic as asserted.

    Authors: The original submission contains the estimator (standard deviation of attention values along temporal diagonals) and the update objective (regularized latent optimization), but we acknowledge they were not presented with sufficient formality. In the revision we will add the explicit equations, define all symbols, and list the (few) hyperparameters with their default values, thereby confirming the method remains training-free and model-agnostic. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical observation followed by plug-and-play regularization

full rationale

The paper's central claim rests on an observed correlation between irregular temporal diagonals in self-attention maps and video incoherence, followed by a training-free latent-update method that promotes diagonal smoothness. No equations, parameter fitting, self-citations, or uniqueness theorems are provided in the supplied text that would reduce the method or its performance claims to the inputs by construction. The derivation chain is therefore self-contained as an empirical intervention rather than a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on an empirical correlation between attention-map patterns and video coherence; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5471 in / 1081 out tokens · 34166 ms · 2026-05-15T04:56:27.059297+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 13 internal anchors

  1. [1]

    Prolific.https://www.prolific.com/. 7

  2. [2]

    Cross-image attention for zero- shot appearance transfer

    Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch- Elor, and Daniel Cohen-Or. Cross-image attention for zero- shot appearance transfer. InACM SIGGRAPH 2024 confer- ence papers, pages 1–12, 2024. 3

  3. [3]

    Uniedit: A unified tuning- free framework for video motion and appearance editing

    Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, and Jiang Bian. Uniedit: A unified tuning- free framework for video motion and appearance editing. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 10171–10180, 2025. 3

  4. [4]

    Separate-and-enhance: Composi- tional finetuning for text-to-image diffusion models

    Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu-Xiong Wang, and Martial Hebert. Separate-and-enhance: Composi- tional finetuning for text-to-image diffusion models. InACM SIGGRAPH 2024 Conference Papers, pages 1–10, 2024. 3

  5. [5]

    Cd- tvd: Contrastive diffusion for 3d super-resolution with scarce high-resolution time-varying data.arXiv preprint arXiv:2508.08173, 2025

    Chongke Bi, Xin Gao, Jiangkang Deng, et al. Cd- tvd: Contrastive diffusion for 3d super-resolution with scarce high-resolution time-varying data.arXiv preprint arXiv:2508.08173, 2025. 2

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2

  7. [7]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 2, 3

  8. [8]

    Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and edit- ing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and edit- ing. InProceedings of the IEEE/CVF international conference on computer vision, pages 22560–22570, 2023. 3

  9. [9]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6

  10. [10]

    Attend-and-excite: Attention-based semantic guid- ance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guid- ance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023. 3

  11. [11]

    VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models

    Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models. InICML, 2025. 2, 3, 6, 7, 8

  12. [12]

    Flatten: optical flow- guided attention for consistent text-to-video editing.arXiv preprint arXiv:2310.05922, 2023

    Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow- guided attention for consistent text-to-video editing.arXiv preprint arXiv:2310.05922, 2023. 3

  13. [13]

    Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021. 2

  14. [14]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 2

  15. [15]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 2

  16. [16]

    Motion prompt- ing: Controlling video generation with motion trajectories

    Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompt- ing: Controlling video generation with motion trajectories. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1–12, 2025. 3

  17. [17]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

  18. [18]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 3

  19. [19]

    Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020. 2, 4, 5

  20. [20]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 2

  21. [21]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 6

  22. [22]

    Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024

    Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, et al. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024. 2, 3

  23. [23]

    How far is video generation from world model: A physical law perspective

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024. 3

  24. [24]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 6

  25. [25]

    Kling AI: Next-Generation AI Creative Studio

    KlingAI. Kling AI: Next-Generation AI Creative Studio. https://klingai.com, 2024. 2

  26. [26]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2

  27. [27]

    Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

    Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024. 3

  28. [28]

    LAION aesthetic-predictor

    LAION-AI. LAION aesthetic-predictor. https://github. com/LAION-AI/aesthetic-predictor, 2022. 6

  29. [29]

    Amt: All-pairs multi-field transforms for efficient frame interpolation

    Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun- Le Guo, and Ming-Ming Cheng. Amt: All-pairs multi-field transforms for efficient frame interpolation. InCVPR, pages 9801–9810, 2023. 6

  30. [30]

    Mo- tionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024

    Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin. Mo- tionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024. 3

  31. [31]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 4, 5

  32. [32]

    Physgen: Rigid-body physics-grounded image-to- video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to- video generation. InEuropean Conference on Computer Vision, pages 360–378. Springer, 2024. 3

  33. [33]

    Video-p2p: Video editing with cross-attention control

    Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 3

  34. [34]

    Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu- Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025. 3

  35. [35]

    Trailblazer: Trajectory control for diffusion-based video gen- eration

    Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video gen- eration. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 3

  36. [36]

    Optical-flow guided prompt optimization for coherent video generation

    Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 7837–7846, 2025. 2, 3

  37. [37]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 2, 3, 4

  38. [38]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  39. [39]

    Video motion transfer with diffusion transformers

    Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, and Fabio Pizzati. Video motion transfer with diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22911–22921,

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PmLR, 2021. 6

  41. [41]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3

  42. [42]

    Flowmo: Variance-based flow guidance for coherent motion in video generation, 2025

    Ariel Shaulov, Itay Hazan, Lior Wolf, and Hila Chefer. Flowmo: Variance-based flow guidance for coherent motion in video generation, 2025. 2, 3, 6, 8

  43. [43]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

  44. [44]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 6

  45. [45]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, pages 402–419. Springer, 2020. 6

  46. [46]

    Mocogan: Decomposing motion and content for video generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 1526–1535, 2018. 2, 3

  47. [47]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

  48. [48]

    Veo 3: Higher-quality video generation with audio and speech

    Veo 3. Veo 3: Higher-quality video generation with audio and speech. Google Cloud Blog, 2025. Announced May 21,

  49. [49]

    Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kinder- mans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022. 2, 3

  50. [50]

    Diffusers: State-of-the-art diffusion models

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022. 5

  51. [51]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  52. [52]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal un- derstanding and generation.arXiv preprint arXiv:2307.06942,

  53. [53]

    Boosting text-to-video generative model with mllms feedback.Advances in Neural Information Processing Systems, 37:139444–139469, 2024

    Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, and Furu Wei. Boosting text-to-video generative model with mllms feedback.Advances in Neural Information Processing Systems, 37:139444–139469, 2024. 3

  54. [54]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 3, 4, 5, 6, 7

  55. [55]

    Think before you diffuse: Infusing physical rules into video diffusion.arXiv preprint arXiv:2505.21653, 2025

    Ke Zhang, Cihan Xiao, Yiqun Mei, Jiacong Xu, and Vishal M Patel. Think before you diffuse: Llms-guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025. 3

  56. [56]

    Holotime: Taming video dif- fusion models for panoramic 4d scene generation.arXiv preprint arXiv:2504.21650, 2025

    Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, and Li Yuan. Holotime: Taming video dif- fusion models for panoramic 4d scene generation.arXiv preprint arXiv:2504.21650, 2025. 2