arxiv: 2604.15911 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

Efficient Video Diffusion Models: Advancements and Challenges

Shitong Shao , Lichen Bai , Pengfei Wan , James Kwok , Zeke Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords videodiffusionefficientmodelsattentioncategorizationchallengesdirections

0 comments

The pith

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion models generate videos by starting with random noise and repeatedly cleaning it up over many steps, but this uses enormous computing power because each step processes both space and time. The paper organizes ways to speed this up into four categories: cutting the total number of cleaning steps, making the internal attention calculations cheaper, shrinking the overall model, and reusing previous calculations through caching or smarter paths. It examines how these choices reduce either the count of steps or the work per step, then flags remaining issues like maintaining quality when combining speedups and the need for better hardware support.

Core claim

To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.

Load-bearing premise

That the proposed four-class categorization (step distillation, efficient attention, model compression, cache/trajectory optimization) comprehensively and unbiasedly covers all relevant methods without significant omissions or overlaps.

Figures

Figures reproduced from arXiv: 2604.15911 by James Kwok, Lichen Bai, Pengfei Wan, Shitong Shao, Zeke Xie.

**Figure 1.** Figure 1: Left: Distribution of literature across various accelerated sampling algorithms for video diffusion models. Middle: Publication trends and adoption growth of accelerated sampling algorithms for video diffusion models (2022–2026). Right: Comparative growth trends of accelerated sampling algorithms in image versus video diffusion tasks (2022–2026). 2022 to 84 in 2025), indicating an early but fast-consolidat… view at source ↗

**Figure 2.** Figure 2: Conceptual illustration of efficient video diffusion generation. The main methods are organized into four major [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of step distillation for accelerated video diffusion. The paradigm reduces NFE by distilling multi-step [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the Self-Forcing algorithm. This framework serves as the foundation for various real-time video generation [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of efficient attention for video diffusion acceleration. The methods reduce per-step overhead via dy [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of representative static attention masks used by different methods under a common illustrative setup with [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of model compression for accelerated video diffusion. The figure highlights quantization-aware training [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of cache and trajectory optimization methods for video diffusion acceleration. The framework integrates [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

Video diffusion models have rapidly become the dominant paradigm for high-fidelity generative video synthesis, but their practical deployment remains constrained by severe inference costs. Compared with image generation, video synthesis compounds computation across spatial-temporal token growth and iterative denoising, making attention and memory traffic major bottlenecks in real-world settings. This survey provides a systematic and deployment-oriented review of efficient video diffusion models. We propose a unified categorization that organizes existing methods into four classes of main paradigms, including step distillation, efficient attention, model compression, and cache/trajectory optimization. Building on this categorization, we respectively analyze algorithmic trends of these four paradigms and examine how different design choices target two core objectives: reducing the number of function evaluations and minimizing per-step overhead. Finally, we discuss open challenges and future directions, including quality preservation under composite acceleration, hardware-software co-design, robust real-time long-horizon generation, and open infrastructure for standardized evaluation. To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey sorts efficiency tricks for video diffusion models into four categories but the taxonomy's coverage and the 'first comprehensive' claim need a full literature check to hold up.

read the letter

This survey sorts efficiency methods for video diffusion into four groups: step distillation, efficient attention, model compression, and cache/trajectory optimization. It links each group to the two main goals of fewer denoising steps or lower cost per step, and it flags practical bottlenecks like spatial-temporal token growth and memory traffic that make video harder than image generation. That framing and the four-way split are the main new pieces here. The paper also walks through trends within each category and lists open problems such as quality loss when accelerations are combined, hardware-software co-design, and the lack of standard benchmarks for long video generation. Those sections give a clear map of where the field is heading for deployment. The soft spots sit mostly in the taxonomy itself. Four classes look tidy on paper, but real methods often cross boundaries—some compression techniques overlap with caching, and attention optimizations can appear inside distillation pipelines. Without seeing the full reference list and how borderline papers were assigned, it is hard to tell if anything important got left out or double-counted. The claim that this is the first comprehensive survey is the usual one surveys make; it will stand or fall on whether prior reviews already covered similar ground. Because the work is purely organizational with no new experiments or proofs, its strength rests entirely on the quality of that organization and the citation coverage. This is useful for someone who needs a quick structured overview before diving into the efficiency literature—say an engineer shipping a video model or a student picking a research direction. It is less useful for readers who want a new method or a formal result. I would send it to peer review. A well-executed survey on this topic can save the community time, and the proposed structure is worth testing and tightening even if the current version needs work on overlaps and completeness.

Circularity Check

0 steps flagged

No circularity: survey paper with no derivations or self-referential claims

full rationale

This is a survey paper that organizes existing literature on efficient video diffusion models into four proposed categories (step distillation, efficient attention, model compression, cache/trajectory optimization). It contains no original equations, derivations, fitted parameters, predictions, or mathematical results. The central claim is that the work is the first comprehensive survey, which is a statement of scope and novelty rather than a derived quantity. No self-citations are load-bearing for any result, and the categorization is an organizational framework applied to external methods, not a reduction to the paper's own inputs. The paper is self-contained as a review and scores 0 on circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical derivations, free parameters, axioms, or invented entities; it relies entirely on prior published work in diffusion models for its reviewed methods.

pith-pipeline@v0.9.0 · 5491 in / 986 out tokens · 48394 ms · 2026-05-10T08:25:24.236182+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Exploring Data-Free LoRA Transferability for Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

CASA uses spectral density to arbitrate between preserving the target model's manifold and restoring LoRA alignment, mitigating style degradation and structural collapse in distilled video diffusion models.

Reference graph

Works this paper leans on

247 extracted references · 219 canonical work pages · cited by 1 Pith paper · 20 internal anchors

[4]

Ganesh Bikshandi, Tri Dao, Pradeep Ramani, Jay Shah, Vijay Thakkar, and Ying Zhang. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. InAdvances in Neural Information Processing Systems 37. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 68658–68685. https://doi.org/10.52202/079017-2193

work page doi:10.52202/079017-2193 2024
[16]

Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Xuyan Chi, Jian Cong, Qinpeng Cui, Qide Dong, Junliang Fan, et al
[17]

Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model. arXiv:2512.13507 https://arxiv.org/abs/2512.13507

work page arXiv
[28]

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. Flex attention: A programming model for generating optimized attention kernels. arXiv:2412.05496 https://arxiv.org/abs/2412.05496

work page arXiv 2024
[29]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations. OpenReview.net. https://openreview.net...

2020
[43]

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. 2025. Mean flows for one-step generative modeling. arXiv:2505.13447 https://arxiv.org/abs/2505.13447

work page internal anchor Pith review arXiv 2025
[49]

Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. 2024. Block Sparse Attention. https://github.com/mit- han-lab/Block-Sparse-Attention

2024
[50]

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al . 2026. LTX-2: Efficient Joint Audio-Visual Foundation Model. arXiv:2601.03233 https: //arxiv.org/abs/2601.03233

work page Pith review arXiv 2026
[51]

Hao-AI-Lab. 2025. FastVideo. https://github.com/hao-ai-lab/FastVideo/tree/main

2025
[57]

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2024. VBench: Comprehensive Benchmark Suite for Video Generative Models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR...

work page doi:10.1109/cvpr52733.2024.02060 2024
[58]

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2026. VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models.IEEE Transactions on Pattern Analysis and Machine I...

work page doi:10.1109/tpami.2025.3633890 2026
[59]

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. 2025. MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives. arXiv:2512.14699 https://arxiv.org/abs/2512.14699

work page arXiv 2025
[60]

Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. 2025. LoViC: Efficient Long Video Generation with Context Compression. arXiv:2507.12952 https://arxiv.org/abs/2507.12952

work page arXiv 2025
[61]

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. 2024. Pyramidal Flow Matching for Efficient Video Generative Modeling. arXiv:2410.05954 https://arxiv.org/abs/2410.05954

work page arXiv 2024
[63]

Kim, Y ., Jang, J., and Shin, S

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S. Ryoo, and Tian Xie. 2024. Adaptive Caching for Faster Video Generation with Diffusion Transformers. arXiv:2411.02397 https://arxiv.org/abs/2411.02397

work page arXiv 2024
[64]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4396–4405. https://doi.org/10.1109/cvpr.2019.00453

work page doi:10.1109/cvpr.2019.00453 2019
[65]

Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, and Seulki Lee. 2025. On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices. arXiv:2502.04363 https://arxiv.org/abs/2502.04363

work page arXiv 2025
[66]

Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, and Youngjae Yu. 2025. Vip: Iterative online preference distillation for efficient video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17235–17245. , Vol. 1, No. 1, Article . Publication date: April 2026. 30•Shitong Shao, James Kwok, Pengfei Wan,...

2025
[67]

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al . 2024. HunyuanVideo: A Systematic Framework For Large Video Generative Models. arXiv:2412.03603 https://arxiv.org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Xin Kong, Daniel Watson, Yannick Strümpler, Michael Niemeyer, and Federico Tombari. 2025. CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis. arXiv:2509.06579 https://arxiv.org/abs/2509.06579

work page arXiv 2025
[69]

Black Forest Labs. 2024. FLUX. https://blackforestlabs.ai/

2024
[70]

Kunyang Li, Mubarak Shah, and Yuzhang Shang. 2026. PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache. arXiv:2601.04359 https://arxiv.org/abs/2601.04359

work page arXiv 2026
[71]

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. 2025. Stable video infinity: Infinite-length video generation with error recycling. arXiv:2510.09212 https://arxiv.org/abs/2510.09212

work page arXiv 2025
[72]

Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, and Song Han. 2025. Radial Attention: 𝑂(𝑛log𝑛) Sparse Attention with Energy Decay for Long Video Generation. arXiv:2506.19852 https://arxiv.org/abs/2506.19852

work page arXiv 2025
[73]

Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. 2025. ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation. arXiv:2410.20502 https://arxiv.org/abs/2410. 20502

work page arXiv 2025
[74]

Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, and Xiaokang Yang. 2025. DVD-Quant: Data-free Video Diffusion Transformers Quantization. arXiv:2505.18663 https://arxiv.org/abs/2505.18663

work page arXiv 2025
[75]

Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. 2024. WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model. arXiv:2411.17459 https://arxiv.org/abs/2411.17459

work page arXiv 2024
[76]

Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. 2025. Looking Backward: Streaming Video-to-Video Translation with Feature Banks. arXiv:2405.15757 https://arxiv.org/abs/2405.15757

work page arXiv 2025
[77]

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. 2025. Diffusion Adversarial Post-Training for One-Step Video Generation. arXiv:2501.08316 https://arxiv.org/abs/2501.08316

work page arXiv 2025
[78]

Akide Liu, Zeyu Zhang, Zhexin Li, Xuehai Bai, Yizeng Han, Jiasheng Tang, Yuanjie Xing, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Gholamreza Haffari, and Bohan Zhuang. 2025. FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion. arXiv:2506.04648 https://arxiv.org/abs/2506.04648

work page arXiv 2025
[79]

Kwok, Sumi Helal, and Zeke Xie

Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James T. Kwok, Sumi Helal, and Zeke Xie. 2026. Alignment of Diffusion Models: Fundamentals, Challenges, and Future.Comput. Surveys58, 9 (March 2026), 1–37. https://doi.org/10.1145/3796982

work page doi:10.1145/3796982 2026
[80]

Chao Liu and Arash Vahdat. 2025. On Equivariance and Fast Sampling in Video Diffusion Models Trained with Warped Noise. arXiv:2504.09789 https://arxiv.org/abs/2504.09789

work page arXiv 2025
[81]

Dong Liu, Yanxuan Yu, Jiayi Zhang, Yifan Li, Ben Lengerich, and Ying Nian Wu. 2025. FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation. arXiv:2505.20353 https://arxiv.org/abs/2505.20353

work page arXiv 2025
[82]

Haosong Liu, Yuge Cheng, Wenxuan Miao, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, and Minyi Guo. 2025. Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers. arXiv:2506.05096 https: //arxiv.org/abs/2506.05096

work page arXiv 2025
[83]

Huaize Liu, Wenzhang Sun, Qiyuan Zhang, Donglin Di, Biao Gong, Hao Li, Chen Wei, and Changqing Zou. 2025. Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion. arXiv:2506.07136 https://arxiv.org/abs/2506.07136

work page arXiv 2025
[84]

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. 2025. From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers. arXiv:2503.06923 https://arxiv.org/abs/2503.06923

work page arXiv 2025
[85]

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. 2025. Rolling Forcing: Autoregressive Long Video Diffusion in Real Time. arXiv:2509.25161 https://arxiv.org/abs/2509.25161

work page arXiv 2025
[86]

Kai Liu, Shaoqiu Zhang, Linghe Kong, and Yulun Zhang. 2025. CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers. arXiv:2509.24416 https://arxiv.org/abs/2509.24416

work page arXiv 2025
[87]

Penghui Liu, Jiangshan Wang, Yutong Shen, Shanhui Mo, Chenyang Qi, and Yue Ma. 2025. MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer. arXiv:2512.07500 https://arxiv.org/abs/2512.07500

work page arXiv 2025
[88]

Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, and Ziwei Liu
[89]

arXiv:2503.20785 https://arxiv.org/abs/2503.20785

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency. arXiv:2503.20785 https://arxiv.org/abs/2503.20785

work page arXiv
[90]

Xingchao Liu, Chengyue Gong, and Qiang Liu. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv:2209.03003 https://arxiv.org/abs/2209.03003

work page internal anchor Pith review arXiv 2022
[91]

Xinyan Liu, Huihong Shi, Yang Xu, and Zhongfeng Wang. 2026. TaQ-DiT: Time-aware Quantization for Diffusion Transformers.IEEE Transactions on Circuits and Systems for Video Technology(2026), 1–1. https://doi.org/10.1109/tcsvt.2026.3652275

work page doi:10.1109/tcsvt.2026.3652275 2026
[92]

Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, and Fei Wang. 2025. UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space. InProceedings of the 33rd ACM International Conference on Multimedia. ACM, 7785–7794. https://doi.org/10.1145/3746027.3755117 , Vol. 1, No. 1, Article . Publication date:...

work page doi:10.1145/3746027.3755117 2025
[93]

Chetwin Low and Weimin Wang. 2025. TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models. arXiv:2506.03099 https://arxiv.org/abs/2506.03099

work page arXiv 2025
[94]

Beijia Lu, Ziyi Chen, Jing Xiao, and Jun-Yan Zhu. 2025. Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers. ACM, 1–11. https://doi.org/10.1145/3757377.3763831

work page doi:10.1145/3757377.3763831 2025
[95]

Guanxing Lu, Zifeng Gao, Tianxing Chen, Wenxun Dai, Ziwei Wang, Wenbo Ding, and Yansong Tang. 2025. ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation. arXiv:2406.01586 https://arxiv.org/abs/2406.01586

work page arXiv 2025
[96]

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al . 2025. Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation. arXiv:2512.04678 https://arxiv.org/abs/2512.04678

work page arXiv 2025
[97]

Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. 2025. Learning Few-Step Diffusion Models by Trajectory Distribution Matching. arXiv:2503.06674 https://arxiv.org/abs/2503.06674

work page arXiv 2025
[98]

Wong, Yu Qiao, and Ziwei Liu

Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, and Ziwei Liu. 2025. Dual-Expert Consistency Model for Efficient and High-Quality Video Generation. arXiv:2506.03123 https://arxiv.org/abs/2506.03123

work page arXiv 2025
[99]

Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K. Wong. 2024. FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality. arXiv:2410.19355 https://arxiv.org/abs/2410.19355

work page arXiv 2024
[100]

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang...

work page arXiv 2025
[101]

Xin Ma, Yaohui Wang, Genyun Jia, Xinyuan Chen, Tien-Tsin Wong, and Cunjian Chen. 2026. Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence(2026), 1–16. https://doi.org/10.1109/tpami.2026.3664227

work page doi:10.1109/tpami.2026.3664227 2026
[102]

Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, and Qi Tian. 2025. MagCache: Fast Video Generation with Magnitude-Aware Cache. arXiv:2506.09045 https://arxiv.org/abs/2506.09045

work page arXiv 2025
[103]

2025.Krea Realtime 14B: Real-time Video Generation

Erwann Millon. 2025.Krea Realtime 14B: Real-time Video Generation. https://github.com/krea-ai/realtime-video

2025
[104]

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. 2024. OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation. arXiv:2407.02371 https://arxiv.org/abs/2407.02371

work page internal anchor Pith review arXiv 2024
[105]

Open-Sora-Plan. 2024. Mixkit. https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0

2024
[106]

William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Transformers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 4172–4182. https://doi.org/10.1109/iccv51070.2023.00387

work page doi:10.1109/iccv51070.2023.00387 2023
[107]

Xurui Peng, Hong Liu, Chenqian Yan, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, and Mingbao Lin. 2025. ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion. arXiv:2508.21091 https://arxiv.org/abs/2508.21091

work page arXiv 2025
[108]

Liang Qiao, Yue Dai, Yeqi Huang, Hongyu Kan, Jun Shi, and Hong An. 2025. FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers. arXiv:2509.25401 https://arxiv.org/abs/2509.25401

work page arXiv 2025
[109]

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. 2023. FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling. arXiv:2310.15169 https://arxiv.org/abs/2310.15169

work page arXiv 2023
[110]

Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, and Liang-Chieh Chen. 2025. Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers. arXiv:2505.14687 https://arxiv.org/abs/2505.14687

work page arXiv 2025
[111]

Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. 2024. Align Your Steps: Optimizing Sampling Schedules in Diffusion Models. arXiv:2404.14507 https://arxiv.org/abs/2404.14507

work page arXiv 2024
[112]

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. 2024. Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. InSIGGRAPH Asia 2024 Conference Papers. ACM, 1–11. https: //doi.org/10.1145/3680528.3687625

work page doi:10.1145/3680528.3687625 2024
[113]

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. 2024. Adversarial Diffusion Distillation. InComputer Vision – ECCV 2024. Springer Nature Switzerland, 87–103. https://doi.org/10.1007/978-3-031-73016-0_6 , Vol. 1, No. 1, Article . Publication date: April 2026. 32•Shitong Shao, James Kwok, Pengfei Wan, Zeke Xie

work page doi:10.1007/978-3-031-73016-0_6 2024
[114]

Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, and Zeke Xie. 2025. MagicDistil- lation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis. arXiv:2503.13319 https://arxiv.org/abs/2503.13319

work page arXiv 2025
[115]

Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, Yan Wang, Haotong Qin, and Hao Tang. 2025. TR-DQ: Time-Rotation Diffusion Quantization. arXiv:2503.06564 https: //arxiv.org/abs/2503.06564

work page arXiv 2025
[116]

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. 2025. FastVID: Dynamic Density Pruning for Fast Video Large Language Models. arXiv:2503.11187 https://arxiv.org/abs/2503.11187

work page arXiv 2025
[117]

Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu
[118]

arXiv:2505.14708 [cs]

DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance. arXiv:2505.14708 https://arxiv.org/abs/2505.14708

work page arXiv
[119]

Kuai Shou. 2024. Kling 2.6. https://app.klingai.com/global/release-notes/c605hp1tzd?type=dialog

2024
[120]

Gaurav Shrivastava and Abhinav Shrivastava. 2024. Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 7236–7245. https://doi.org/10.1109/cvpr52733. 2024.00691

work page doi:10.1109/cvpr52733 2024
[121]

SkyTimelapse. 2021. SkyTimelapse. youtube.com/channel/UCtLemFmUPZYItte3PpG7f2Q/videos?reload=9

2021
[122]

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. 2023. Consistency Models. InInternational Conference on Machine Learning. PMLR, 32211–32252

2023
[123]

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2023. Score-Based Generative Modeling through Stochastic Differential Equations. InInternational Conference on Learning Representations. OpenReview.net. https: //openreview.net/forum?id=PxTIG12RRHS

2023
[124]

K Soomro. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 https://arxiv.org/abs/1212.0402

work page internal anchor Pith review arXiv 2012
[125]

Stability.ai. 2024. Introducing Stable Diffusion 3.5. https://stability.ai/news/introducing-stable-diffusion-3-5

2024
[126]

Wenzhang Sun, Qirui Hou, Donglin Di, Jiahui Yang, Yongjia Ma, and Jianxun Cui. 2025. UniCP: A Unified Caching and Pruning Framework for Efficient Video Generation. InProceedings of the 7th ACM International Conference on Multimedia in Asia. ACM, 1–7. https://doi.org/10.1145/3743093.3770981

work page doi:10.1145/3743093.3770981 2025
[127]

Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, and Dacheng Tao. 2025. VORTA: Efficient Video Diffusion via Routing Sparse Attention. arXiv:2505.18809 https://arxiv.org/abs/2505.18809

work page arXiv 2025
[128]

Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, and Dacheng Tao. 2024. AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration. arXiv:2412.11706 https://arxiv.org/abs/2412.11706

work page arXiv 2024

Showing first 80 references.