Recognition: no theorem link
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
Pith reviewed 2026-05-15 04:56 UTC · model grok-4.3
The pith
TeDiO improves temporal coherence in video diffusion by smoothing irregular diagonals in self-attention maps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Incoherent videos exhibit irregular, fragmented temporal diagonals in intermediate self-attention maps, while coherent motion shows smooth band-diagonal patterns. TeDiO reinforces temporal consistency by estimating diagonal smoothness, identifying unstable regions, and performing lightweight latent updates to promote coherent frame-to-frame dynamics.
What carries the argument
TeDiO, a training-free optimization that regularizes temporal diagonals in self-attention maps through targeted latent updates.
If this is right
- Markedly smoother motion in generated videos across multiple diffusion models.
- Preservation of per-frame visual quality.
- Applicable as a plug-and-play addition to existing video generation pipelines.
- Efficient inference-time improvement without weight modifications.
Where Pith is reading between the lines
- Attention map diagnostics could be used to detect other types of generation failures in diffusion models.
- The approach might extend to improving coherence in other sequential generation tasks such as audio synthesis.
- Future work could explore whether similar diagonal regularization applies to cross-attention layers.
- Testing on a wider range of video lengths and motion complexities would clarify the method's limits.
Load-bearing premise
That the primary cause of temporal incoherence is irregular temporal diagonals in self-attention and that lightweight latent updates can reliably smooth them without creating new artifacts.
What would settle it
Generate videos with deliberately introduced temporal artifacts and observe whether TeDiO fails to smooth the attention diagonals or reduces visual quality in those cases.
Figures
read the original abstract
Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper observes that temporal incoherence in video diffusion transformers manifests as irregular, fragmented temporal diagonals in intermediate self-attention maps, while coherent motion produces smooth band-diagonal patterns. It introduces TeDiO, a training-free inference-time method that estimates diagonal smoothness, identifies unstable regions, and applies lightweight latent updates to promote coherent frame-to-frame dynamics without modifying weights or using external supervision. The method is evaluated across models such as Wan2.1 and CogVideoX, claiming markedly smoother motion while preserving per-frame visual quality.
Significance. If the central empirical claim holds after proper validation, TeDiO would offer a lightweight, plug-and-play inference-time intervention that improves temporal coherence in existing pre-trained video diffusion models. This could be practically significant for video generation pipelines where retraining is costly, provided the diagonal-regularization mechanism is shown to be causal rather than incidental.
major comments (3)
- [Experiments / Ablation studies] The central claim that irregular temporal diagonals are the primary cause of flickering (and that targeting them via latent updates reliably restores coherence) requires explicit ablation isolating the diagonal smoothness term. Without comparisons to generic latent optimization or alternative attention regularizers, it remains unclear whether the specific diagonal focus is load-bearing or whether any consistency-promoting update would suffice.
- [Results / Quantitative evaluation] Quantitative support for 'markedly smoother motion' is not detailed in the provided description. The results section must report specific metrics (e.g., temporal consistency scores, optical-flow variance, user-study percentages) with statistical significance and direct comparisons to baselines and prior training-free methods; absence of these numbers leaves the improvement claim unverified.
- [Method / TeDiO formulation] The method description must supply the precise formulation of the diagonal-smoothness estimator and the latent-update objective (including any hyperparameters). Without these equations, it is impossible to assess whether the procedure is truly parameter-free or model-agnostic as asserted.
minor comments (2)
- [Method] Clarify the exact definition of 'temporal diagonal' (e.g., which attention heads/layers and how the band width is chosen) to avoid ambiguity in replication.
- [Discussion] Add a limitations paragraph discussing potential failure modes, such as degradation on highly dynamic scenes or interaction with classifier-free guidance scales.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where revisions are needed to strengthen the manuscript, we will incorporate them in the next version.
read point-by-point responses
-
Referee: [Experiments / Ablation studies] The central claim that irregular temporal diagonals are the primary cause of flickering (and that targeting them via latent updates reliably restores coherence) requires explicit ablation isolating the diagonal smoothness term. Without comparisons to generic latent optimization or alternative attention regularizers, it remains unclear whether the specific diagonal focus is load-bearing or whether any consistency-promoting update would suffice.
Authors: We agree that explicit ablations are necessary to isolate the contribution of the diagonal smoothness term. In the revised manuscript we will add new experiments comparing TeDiO against (i) generic latent-space optimization without the diagonal term and (ii) alternative attention regularizers (e.g., total-variation on attention maps). These results will demonstrate that the temporal-diagonal focus is load-bearing for the observed coherence gains. revision: yes
-
Referee: [Results / Quantitative evaluation] Quantitative support for 'markedly smoother motion' is not detailed in the provided description. The results section must report specific metrics (e.g., temporal consistency scores, optical-flow variance, user-study percentages) with statistical significance and direct comparisons to baselines and prior training-free methods; absence of these numbers leaves the improvement claim unverified.
Authors: We will expand the results section with a new table reporting concrete values: temporal consistency scores (CLIP-based and feature-based), optical-flow variance, and user-study preference percentages (with 95% confidence intervals and p-values). Direct comparisons to prior training-free baselines will be included. The full manuscript already contains some of these metrics; the revision will present them more prominently and with statistical tests. revision: yes
-
Referee: [Method / TeDiO formulation] The method description must supply the precise formulation of the diagonal-smoothness estimator and the latent-update objective (including any hyperparameters). Without these equations, it is impossible to assess whether the procedure is truly parameter-free or model-agnostic as asserted.
Authors: The original submission contains the estimator (standard deviation of attention values along temporal diagonals) and the update objective (regularized latent optimization), but we acknowledge they were not presented with sufficient formality. In the revision we will add the explicit equations, define all symbols, and list the (few) hyperparameters with their default values, thereby confirming the method remains training-free and model-agnostic. revision: partial
Circularity Check
No circularity; empirical observation followed by plug-and-play regularization
full rationale
The paper's central claim rests on an observed correlation between irregular temporal diagonals in self-attention maps and video incoherence, followed by a training-free latent-update method that promotes diagonal smoothness. No equations, parameter fitting, self-citations, or uniqueness theorems are provided in the supplied text that would reduce the method or its performance claims to the inputs by construction. The derivation chain is therefore self-contained as an empirical intervention rather than a self-referential loop.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Prolific.https://www.prolific.com/. 7
-
[2]
Cross-image attention for zero- shot appearance transfer
Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch- Elor, and Daniel Cohen-Or. Cross-image attention for zero- shot appearance transfer. InACM SIGGRAPH 2024 confer- ence papers, pages 1–12, 2024. 3
work page 2024
-
[3]
Uniedit: A unified tuning- free framework for video motion and appearance editing
Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, and Jiang Bian. Uniedit: A unified tuning- free framework for video motion and appearance editing. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 10171–10180, 2025. 3
work page 2025
-
[4]
Separate-and-enhance: Composi- tional finetuning for text-to-image diffusion models
Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu-Xiong Wang, and Martial Hebert. Separate-and-enhance: Composi- tional finetuning for text-to-image diffusion models. InACM SIGGRAPH 2024 Conference Papers, pages 1–10, 2024. 3
work page 2024
-
[5]
Chongke Bi, Xin Gao, Jiangkang Deng, et al. Cd- tvd: Contrastive diffusion for 3d super-resolution with scarce high-resolution time-varying data.arXiv preprint arXiv:2508.08173, 2025. 2
-
[6]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 2, 3
work page 2024
-
[8]
Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and edit- ing
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and edit- ing. InProceedings of the IEEE/CVF international conference on computer vision, pages 22560–22570, 2023. 3
work page 2023
-
[9]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6
work page 2021
-
[10]
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guid- ance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023. 3
work page 2023
-
[11]
VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models
Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for en- hanced motion generation in video models. InICML, 2025. 2, 3, 6, 7, 8
work page 2025
-
[12]
Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow- guided attention for consistent text-to-video editing.arXiv preprint arXiv:2310.05922, 2023. 3
-
[13]
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021. 2
work page 2021
-
[14]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 2
work page 2024
-
[15]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Motion prompt- ing: Controlling video generation with motion trajectories
Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompt- ing: Controlling video generation with motion trajectories. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1–12, 2025. 3
work page 2025
-
[17]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 33:6840–6851, 2020. 2, 4, 5
work page 2020
-
[20]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 6
work page 2024
-
[22]
Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, et al. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161, 2024. 2, 3
-
[23]
How far is video generation from world model: A physical law perspective
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024. 3
-
[24]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 6
work page 2021
-
[25]
Kling AI: Next-Generation AI Creative Studio
KlingAI. Kling AI: Next-Generation AI Creative Studio. https://klingai.com, 2024. 2
work page 2024
-
[26]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024. 3
-
[28]
LAION-AI. LAION aesthetic-predictor. https://github. com/LAION-AI/aesthetic-predictor, 2022. 6
work page 2022
-
[29]
Amt: All-pairs multi-field transforms for efficient frame interpolation
Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun- Le Guo, and Ming-Ming Cheng. Amt: All-pairs multi-field transforms for efficient frame interpolation. InCVPR, pages 9801–9810, 2023. 6
work page 2023
-
[30]
Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin. Mo- tionclone: Training-free motion cloning for controllable video generation.arXiv preprint arXiv:2406.05338, 2024. 3
-
[31]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Physgen: Rigid-body physics-grounded image-to- video generation
Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to- video generation. InEuropean Conference on Computer Vision, pages 360–378. Springer, 2024. 3
work page 2024
-
[33]
Video-p2p: Video editing with cross-attention control
Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 3
work page 2024
-
[34]
Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu- Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025. 3
-
[35]
Trailblazer: Trajectory control for diffusion-based video gen- eration
Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video gen- eration. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 3
work page 2024
-
[36]
Optical-flow guided prompt optimization for coherent video generation
Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 7837–7846, 2025. 2, 3
work page 2025
-
[37]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 2, 3, 4
work page 2023
-
[38]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Video motion transfer with diffusion transformers
Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, and Fabio Pizzati. Video motion transfer with diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22911–22921,
-
[40]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PmLR, 2021. 6
work page 2021
-
[41]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3
work page 2022
-
[42]
Flowmo: Variance-based flow guidance for coherent motion in video generation, 2025
Ariel Shaulov, Itay Hazan, Lior Wolf, and Hila Chefer. Flowmo: Variance-based flow guidance for coherent motion in video generation, 2025. 2, 3, 6, 8
work page 2025
-
[43]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 6
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[45]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV, pages 402–419. Springer, 2020. 6
work page 2020
-
[46]
Mocogan: Decomposing motion and content for video generation
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 1526–1535, 2018. 2, 3
work page 2018
-
[47]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4
work page 2017
-
[48]
Veo 3: Higher-quality video generation with audio and speech
Veo 3. Veo 3: Higher-quality video generation with audio and speech. Google Cloud Blog, 2025. Announced May 21,
work page 2025
-
[49]
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kinder- mans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022. 2, 3
-
[50]
Diffusers: State-of-the-art diffusion models
Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022. 5
work page 2022
-
[51]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal un- derstanding and generation.arXiv preprint arXiv:2307.06942,
work page internal anchor Pith review arXiv
-
[53]
Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, and Furu Wei. Boosting text-to-video generative model with mllms feedback.Advances in Neural Information Processing Systems, 37:139444–139469, 2024. 3
work page 2024
-
[54]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 2, 3, 4, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Ke Zhang, Cihan Xiao, Yiqun Mei, Jiacong Xu, and Vishal M Patel. Think before you diffuse: Llms-guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025. 3
-
[56]
Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, and Li Yuan. Holotime: Taming video dif- fusion models for panoramic 4d scene generation.arXiv preprint arXiv:2504.21650, 2025. 2
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.