arxiv: 2605.14136 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: no theorem link

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

Nurislam Tursynbek , Zhiqiang Lao , Heather Yu , Gedas Bertasius , Marc Niethammer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords temporal coherencevideo diffusionself-attention mapstraining-free optimizationlatent updatesmotion consistencydiffusion transformers

0 comments

The pith

TeDiO improves temporal coherence in video diffusion by smoothing irregular diagonals in self-attention maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion transformers can generate good-looking frames but often fail at consistent motion across time, resulting in flickering or drifting. The paper identifies that these problems appear as broken or irregular diagonal patterns in the self-attention maps computed inside the model. TeDiO addresses this by measuring the smoothness of those diagonals during generation, locating the problematic areas, and applying small changes to the latent codes to encourage smooth band-diagonal attention. This regularization happens at inference time without any model retraining or external motion data. Tests on models including Wan2.1 and CogVideoX show smoother videos with no loss in individual frame quality.

Core claim

Incoherent videos exhibit irregular, fragmented temporal diagonals in intermediate self-attention maps, while coherent motion shows smooth band-diagonal patterns. TeDiO reinforces temporal consistency by estimating diagonal smoothness, identifying unstable regions, and performing lightweight latent updates to promote coherent frame-to-frame dynamics.

What carries the argument

TeDiO, a training-free optimization that regularizes temporal diagonals in self-attention maps through targeted latent updates.

If this is right

Markedly smoother motion in generated videos across multiple diffusion models.
Preservation of per-frame visual quality.
Applicable as a plug-and-play addition to existing video generation pipelines.
Efficient inference-time improvement without weight modifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention map diagnostics could be used to detect other types of generation failures in diffusion models.
The approach might extend to improving coherence in other sequential generation tasks such as audio synthesis.
Future work could explore whether similar diagonal regularization applies to cross-attention layers.
Testing on a wider range of video lengths and motion complexities would clarify the method's limits.

Load-bearing premise

That the primary cause of temporal incoherence is irregular temporal diagonals in self-attention and that lightweight latent updates can reliably smooth them without creating new artifacts.

What would settle it

Generate videos with deliberately introduced temporal artifacts and observe whether TeDiO fails to smooth the attention diagonals or reduces visual quality in those cases.

Figures

Figures reproduced from arXiv: 2605.14136 by Gedas Bertasius, Heather Yu, Marc Niethammer, Nurislam Tursynbek, Zhiqiang Lao.

**Figure 1.** Figure 1: TeDiO, our training-free inference-time method, substantially improves motion coherence in state-of-the-art text-to-video diffusion transformers. Shown are comparisons on Wan2.1 and CogVideoX: without TeDiO, generations suffer from jittery inconsistent motion, merging objects, subject duplication, and physically implausible dynamics. TeDiO resolves these artifacts — producing temporally stable, coherent vi… view at source ↗

**Figure 2.** Figure 2: Temporal attention maps reflects motion coherence. Incoherent videos often exhibit flickering or abrupt transitions (top row), whereas coherent videos show smooth and stable dynamics (bottom row). Their corresponding temporal attention maps reveal a structural difference: incoherent motion produces irregular, fragmented diagonals, while coherent motion shows clean, band-diagonal patterns that indicate stab… view at source ↗

**Figure 3.** Figure 3: Overview of TeDiO. During the sampling timestep t, the queries (Q) and the keys (K) from block i of frozen ( ) Diffusion Transformer (DiT) are reshaped to isolate temporal interactions. A lightweight inference-time optimization adjusts the latent zt ( ) using the LTeDiO loss, encouraging smooth, band-diagonal patterns in temporal attention—indicative of stable frame-to-frame dependencies. This process requ… view at source ↗

**Figure 4.** Figure 4: TeDiO improves motion coherence, flow consistency, and smooths diagonal spikes, while preserving motion patterns. These results demonstrate enhanced frame-to-frame stability and reduced representational drift - crucially, achieved without the need for model fine-tuning or additional supervision. As expected, Dynamic Degree (DD) decreases, reflecting the known trade-off between temporal smoothness and mo… view at source ↗

**Figure 5.** Figure 5: Across VideoJAM-Bench [11] prompts, TeDiO produces more temporally stable and smoother videos than the base models. Wan2.1 CogVideoX 55.35 57.45 18.18 20.61 26.47 21.94 Baseline Tie Baseline+TeDiO [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: User preference study on temporal coherence. Participants selected which video had better temporal consistency. User Study. Quantitative metrics offer important diagnostic signals but do not always reflect human perception of temporal smoothness. To complement automatic evaluation, we conducted a perceptual study using VideoJAMBench [11]. 10 participants on Prolific [1] viewed paired video generations … view at source ↗

read the original abstract

Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper observes that temporal incoherence in video diffusion transformers manifests as irregular, fragmented temporal diagonals in intermediate self-attention maps, while coherent motion produces smooth band-diagonal patterns. It introduces TeDiO, a training-free inference-time method that estimates diagonal smoothness, identifies unstable regions, and applies lightweight latent updates to promote coherent frame-to-frame dynamics without modifying weights or using external supervision. The method is evaluated across models such as Wan2.1 and CogVideoX, claiming markedly smoother motion while preserving per-frame visual quality.

Significance. If the central empirical claim holds after proper validation, TeDiO would offer a lightweight, plug-and-play inference-time intervention that improves temporal coherence in existing pre-trained video diffusion models. This could be practically significant for video generation pipelines where retraining is costly, provided the diagonal-regularization mechanism is shown to be causal rather than incidental.

major comments (3)

[Experiments / Ablation studies] The central claim that irregular temporal diagonals are the primary cause of flickering (and that targeting them via latent updates reliably restores coherence) requires explicit ablation isolating the diagonal smoothness term. Without comparisons to generic latent optimization or alternative attention regularizers, it remains unclear whether the specific diagonal focus is load-bearing or whether any consistency-promoting update would suffice.
[Results / Quantitative evaluation] Quantitative support for 'markedly smoother motion' is not detailed in the provided description. The results section must report specific metrics (e.g., temporal consistency scores, optical-flow variance, user-study percentages) with statistical significance and direct comparisons to baselines and prior training-free methods; absence of these numbers leaves the improvement claim unverified.
[Method / TeDiO formulation] The method description must supply the precise formulation of the diagonal-smoothness estimator and the latent-update objective (including any hyperparameters). Without these equations, it is impossible to assess whether the procedure is truly parameter-free or model-agnostic as asserted.

minor comments (2)

[Method] Clarify the exact definition of 'temporal diagonal' (e.g., which attention heads/layers and how the band width is chosen) to avoid ambiguity in replication.
[Discussion] Add a limitations paragraph discussing potential failure modes, such as degradation on highly dynamic scenes or interaction with classifier-free guidance scales.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where revisions are needed to strengthen the manuscript, we will incorporate them in the next version.

read point-by-point responses

Referee: [Experiments / Ablation studies] The central claim that irregular temporal diagonals are the primary cause of flickering (and that targeting them via latent updates reliably restores coherence) requires explicit ablation isolating the diagonal smoothness term. Without comparisons to generic latent optimization or alternative attention regularizers, it remains unclear whether the specific diagonal focus is load-bearing or whether any consistency-promoting update would suffice.

Authors: We agree that explicit ablations are necessary to isolate the contribution of the diagonal smoothness term. In the revised manuscript we will add new experiments comparing TeDiO against (i) generic latent-space optimization without the diagonal term and (ii) alternative attention regularizers (e.g., total-variation on attention maps). These results will demonstrate that the temporal-diagonal focus is load-bearing for the observed coherence gains. revision: yes
Referee: [Results / Quantitative evaluation] Quantitative support for 'markedly smoother motion' is not detailed in the provided description. The results section must report specific metrics (e.g., temporal consistency scores, optical-flow variance, user-study percentages) with statistical significance and direct comparisons to baselines and prior training-free methods; absence of these numbers leaves the improvement claim unverified.

Authors: We will expand the results section with a new table reporting concrete values: temporal consistency scores (CLIP-based and feature-based), optical-flow variance, and user-study preference percentages (with 95% confidence intervals and p-values). Direct comparisons to prior training-free baselines will be included. The full manuscript already contains some of these metrics; the revision will present them more prominently and with statistical tests. revision: yes
Referee: [Method / TeDiO formulation] The method description must supply the precise formulation of the diagonal-smoothness estimator and the latent-update objective (including any hyperparameters). Without these equations, it is impossible to assess whether the procedure is truly parameter-free or model-agnostic as asserted.

Authors: The original submission contains the estimator (standard deviation of attention values along temporal diagonals) and the update objective (regularized latent optimization), but we acknowledge they were not presented with sufficient formality. In the revision we will add the explicit equations, define all symbols, and list the (few) hyperparameters with their default values, thereby confirming the method remains training-free and model-agnostic. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical observation followed by plug-and-play regularization

full rationale

The paper's central claim rests on an observed correlation between irregular temporal diagonals in self-attention maps and video incoherence, followed by a training-free latent-update method that promotes diagonal smoothness. No equations, parameter fitting, self-citations, or uniqueness theorems are provided in the supplied text that would reduce the method or its performance claims to the inputs by construction. The derivation chain is therefore self-contained as an empirical intervention rather than a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on an empirical correlation between attention-map patterns and video coherence; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5471 in / 1081 out tokens · 34166 ms · 2026-05-15T04:56:27.059297+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 13 internal anchors

[1]

Prolific.https://www.prolific.com/. 7

work page
[2]

Cross-image attention for zero- shot appearance transfer

Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch- Elor, and Daniel Cohen-Or. Cross-image attention for zero- shot appearance transfer. InACM SIGGRAPH 2024 confer- ence papers, pages 1–12, 2024. 3

work page 2024
[3]

Uniedit: A unified tuning- free framework for video motion and appearance editing

Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, and Jiang Bian. Uniedit: A unified tuning- free framework for video motion and appearance editing. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 10171–10180, 2025. 3

work page 2025
[4]

Separate-and-enhance: Composi- tional finetuning for text-to-image diffusion models

Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu-Xiong Wang, and Martial Hebert. Separate-and-enhance: Composi- tional finetuning for text-to-image diffusion models. InACM SIGGRAPH 2024 Conference Papers, pages 1–10, 2024. 3

work page 2024
[5]

Cd- tvd: Contrastive diffusion for 3d super-resolution with scarce high-resolution time-varying data.arXiv preprint arXiv:2508.08173, 2025

Chongke Bi, Xin Gao, Jiangkang Deng, et al. Cd- tvd: Contrastive diffusion for 3d super-resolution with scarce high-resolution time-varying data.arXiv preprint arXiv:2508.08173, 2025. 2

work page arXiv 2025
[6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 2, 3

work page 2024
[8]

Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and edit- ing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and edit- ing. InProceedings of the IEEE/CVF international conference on computer vision, pages 22560–22570, 2023. 3

work page 2023
[9]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6

work page 2021
[10]

Attend-and-excite: Attention-based semantic guid- ance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guid- ance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023. 3

work page 2023
[11]

VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models. InICML, 2025. 2, 3, 6, 7, 8

work page 2025
[12]

Flatten: optical flow- guided attention for consistent text-to-video editing.arXiv preprint arXiv:2310.05922, 2023

Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow- guided attention for consistent text-to-video editing.arXiv preprint arXiv:2310.05922, 2023. 3

work page arXiv 2023
[13]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021. 2

work page 2021
[14]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 2

work page 2024
[15]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Motion prompt- ing: Controlling video generation with motion trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompt- ing: Controlling video generation with motion trajectories. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1–12, 2025. 3

work page 2025
[17]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020. 2, 4, 5

work page 2020
[20]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 6

work page 2024
[22]

Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024

Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, et al. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024. 2, 3

work page arXiv 2024
[23]

How far is video generation from world model: A physical law perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024. 3

work page arXiv 2024
[24]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 6

work page 2021
[25]

Kling AI: Next-Generation AI Creative Studio

KlingAI. Kling AI: Next-Generation AI Creative Studio. https://klingai.com, 2024. 2

work page 2024
[26]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024. 3

work page arXiv 2024
[28]

LAION aesthetic-predictor

LAION-AI. LAION aesthetic-predictor. https://github. com/LAION-AI/aesthetic-predictor, 2022. 6

work page 2022
[29]

Amt: All-pairs multi-field transforms for efficient frame interpolation

Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun- Le Guo, and Ming-Ming Cheng. Amt: All-pairs multi-field transforms for efficient frame interpolation. InCVPR, pages 9801–9810, 2023. 6

work page 2023
[30]

Mo- tionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin. Mo- tionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024. 3

work page arXiv 2024
[31]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Physgen: Rigid-body physics-grounded image-to- video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to- video generation. InEuropean Conference on Computer Vision, pages 360–378. Springer, 2024. 3

work page 2024
[33]

Video-p2p: Video editing with cross-attention control

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 3

work page 2024
[34]

Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu- Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025. 3

work page arXiv 2025
[35]

Trailblazer: Trajectory control for diffusion-based video gen- eration

Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video gen- eration. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 3

work page 2024
[36]

Optical-flow guided prompt optimization for coherent video generation

Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 7837–7846, 2025. 2, 3

work page 2025
[37]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 2, 3, 4

work page 2023
[38]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Video motion transfer with diffusion transformers

Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, and Fabio Pizzati. Video motion transfer with diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22911–22921,

work page
[40]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PmLR, 2021. 6

work page 2021
[41]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3

work page 2022
[42]

Flowmo: Variance-based flow guidance for coherent motion in video generation, 2025

Ariel Shaulov, Itay Hazan, Lior Wolf, and Hila Chefer. Flowmo: Variance-based flow guidance for coherent motion in video generation, 2025. 2, 3, 6, 8

work page 2025
[43]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 6

work page internal anchor Pith review Pith/arXiv arXiv 2012
[45]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, pages 402–419. Springer, 2020. 6

work page 2020
[46]

Mocogan: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 1526–1535, 2018. 2, 3

work page 2018
[47]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

work page 2017
[48]

Veo 3: Higher-quality video generation with audio and speech

Veo 3. Veo 3: Higher-quality video generation with audio and speech. Google Cloud Blog, 2025. Announced May 21,

work page 2025
[49]

Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kinder- mans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022. 2, 3

work page arXiv 2022
[50]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022. 5

work page 2022
[51]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal un- derstanding and generation.arXiv preprint arXiv:2307.06942,

work page internal anchor Pith review arXiv
[53]

Boosting text-to-video generative model with mllms feedback.Advances in Neural Information Processing Systems, 37:139444–139469, 2024

Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, and Furu Wei. Boosting text-to-video generative model with mllms feedback.Advances in Neural Information Processing Systems, 37:139444–139469, 2024. 3

work page 2024
[54]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 3, 4, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Think before you diffuse: Infusing physical rules into video diffusion.arXiv preprint arXiv:2505.21653, 2025

Ke Zhang, Cihan Xiao, Yiqun Mei, Jiacong Xu, and Vishal M Patel. Think before you diffuse: Llms-guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025. 3

work page arXiv 2025
[56]

Holotime: Taming video dif- fusion models for panoramic 4d scene generation.arXiv preprint arXiv:2504.21650, 2025

Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, and Li Yuan. Holotime: Taming video dif- fusion models for panoramic 4d scene generation.arXiv preprint arXiv:2504.21650, 2025. 2

work page arXiv 2025