RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

Chensheng Dai; Shengjun Zhang; Yifan Li; Yueqi Duan; Zhang Zhang; Zheng Zhu

arxiv: 2606.06309 · v1 · pith:FK2QSPZMnew · submitted 2026-06-04 · 💻 cs.CV

RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

Chensheng Dai , Shengjun Zhang , Yifan Li , Zhang Zhang , Zheng Zhu , Yueqi Duan This is my paper

Pith reviewed 2026-06-28 02:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationdiffusion transformersdenoising accelerationkeyframe schedulingtraining-free methodlatent trajectory projectionasynchronous flow

0 comments

The pith

Video diffusion models can accelerate by giving full denoising only to sparse keyframes while skipping steps on others.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the standard uniform dense denoising across every frame and timestep is redundant for natural video because adjacent frames share corresponding content and motion. It proposes identifying a sparse set of pivotal keyframes that dominate semantic evolution for complete step-by-step denoising, while non-keyframes skip intermediate steps to cut computation. A latent trajectory projection module is introduced so that keyframes can still interact with a full temporally consistent sequence representation despite the skips. Experiments on DiT-based video models show the approach delivers higher inference speed and improved visual quality over prior acceleration baselines. The framework operates without any model retraining or fine-tuning.

Core claim

RhymeFlow decouples the denoising trajectories of different frames. It first selects a sparse set of pivotal keyframes that capture critical semantic transitions and subjects only those to dense denoising across all timesteps. Non-keyframes progressively skip denoising steps. The latent trajectory projection module then enables the keyframes to interact with a complete and temporally consistent sequence representation, preventing visual degradation from the broken coherence caused by skipped states on other frames.

What carries the argument

Asynchronous denoising flow scheduling that separates keyframe and non-keyframe trajectories, combined with a latent trajectory projection module that restores temporal consistency.

If this is right

Inference latency drops because most frames avoid the full sequence of denoising steps.
Visual quality improves over rigid per-frame acceleration methods that still enforce dense trajectories everywhere.
The method works on existing DiT-based video models with no additional training required.
Overall computational cost falls while preserving the structural integrity anchored by the keyframes.
Temporal coherence is maintained through the projection step even though denoising is no longer synchronized across frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same keyframe-anchoring principle could be tested on non-video diffusion tasks that exhibit element-wise redundancy, such as long image sequences or 3D asset generation.
Combining this scheduling with existing sparse-attention or KV-cache techniques might compound the speed gains without extra engineering.
Adaptive selection of keyframes based on per-video motion statistics rather than a fixed sparsity ratio could further reduce average cost on simple scenes.
Real-time video synthesis pipelines might become practical once the per-frame denoising budget is reduced to a small fraction of the original timesteps.

Load-bearing premise

When a sparse set of keyframes with critical semantic transitions are fully denoised, the intermediate states of the remaining frames follow sufficiently predictable trajectories that skipping steps on them does not harm the final output.

What would settle it

Apply the method to a video sequence containing rapid unpredictable motion changes across nearly all frames and measure whether the output exhibits visible artifacts or coherence loss relative to a fully dense baseline run.

read the original abstract

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RhymeFlow decouples denoising so keyframes get full steps while others skip, using a projection module for coherence, but the predictability assumption looks vulnerable on complex motion.

read the letter

RhymeFlow's core move is to identify sparse keyframes that receive dense denoising across all timesteps while non-keyframes skip steps, then apply a latent trajectory projection module so the keyframes can still interact with a consistent full sequence.

The new element is this specific combination of keyframe anchoring for trajectory decoupling plus the projection step to handle the coherence breaks that skipping creates. It differs from the sparse attention and KV-caching baselines mentioned, which still run every frame through the complete schedule.

The paper correctly flags a real deployment pain point: quadratic 3D attention in DiT video models makes inference slow, and a training-free fix would be immediately usable. The observation that adjacent frames share content and motion is plausible for many natural videos.

The soft spot is the load-bearing assumption that once keyframes are fixed, the intermediate states of other frames stay predictable enough to skip safely. The stress-test note is right that this breaks down when motions are rapid or semantic changes are frequent; the projection module is meant to compensate, but the abstract gives no detail on whether it adds artifacts, extra compute, or fails in those cases. Experiments are claimed to show gains in speed and quality, yet the reader's low-confidence verdict on soundness is fair given the lack of visible setup or numbers.

This is for groups working on practical inference optimization for video diffusion. A reader already thinking about scheduling tricks could extract the scheduling idea and test it themselves.

The work shows honest engagement with the acceleration literature and a concrete algorithmic proposal. It deserves peer review so the experiments and edge cases can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce RhymeFlow, a training-free acceleration method for DiT-based video generation. It decouples denoising trajectories by identifying a sparse set of pivotal keyframes that receive dense step-by-step denoising while non-keyframes skip steps to reduce cost; a latent trajectory projection module is added to restore temporal coherence when skipped states would otherwise degrade quality. The central empirical claim is that this yields higher inference speed and better visual quality than existing baselines on current DiT video models.

Significance. If the empirical results and the projection module's effectiveness hold, the work offers a practical, training-free route to exploit temporal redundancy in natural video denoising, which could meaningfully lower inference costs for DiT video generators. The training-free design and focus on asynchronous flow scheduling are clear strengths; reproducible code or parameter-free derivations are not mentioned.

major comments (2)

[Introduction] Introduction (abstract and opening paragraphs): The load-bearing assumption that 'when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories' is stated without quantitative support or testing on rapid/non-smooth motion cases; this directly justifies the skipping mechanism yet remains the point most exposed to failure, as complex motions could break coherence before the projection module acts.
[Method] Method (latent trajectory projection module description): The module is introduced to let keyframes interact with a 'complete and temporally consistent sequence representation,' but no equations, pseudocode, or complexity analysis are provided to show its overhead relative to the claimed savings from skipped steps or to demonstrate it avoids new artifacts; this is central to the claim that skipping does not degrade quality.

minor comments (2)

[Method] The keyframe identification criteria are listed as a free parameter; an explicit statement of the default heuristic or sensitivity analysis would improve reproducibility.
[Experiments] Figure captions and experimental tables should include exact model variants, number of frames, and hardware used so that speed/quality deltas can be directly compared to the cited baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned changes to the manuscript.

read point-by-point responses

Referee: [Introduction] Introduction (abstract and opening paragraphs): The load-bearing assumption that 'when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories' is stated without quantitative support or testing on rapid/non-smooth motion cases; this directly justifies the skipping mechanism yet remains the point most exposed to failure, as complex motions could break coherence before the projection module acts.

Authors: We agree that quantitative validation would strengthen the justification. The assumption stems from our observations on natural video data, but we will add a dedicated analysis (new figure and metrics such as average latent trajectory deviation) in the revised introduction and experiments section. This will include tests on rapid and non-smooth motion sequences to demonstrate the robustness of the skipping mechanism before the projection module is applied. revision: partial
Referee: [Method] Method (latent trajectory projection module description): The module is introduced to let keyframes interact with a 'complete and temporally consistent sequence representation,' but no equations, pseudocode, or complexity analysis are provided to show its overhead relative to the claimed savings from skipped steps or to demonstrate it avoids new artifacts; this is central to the claim that skipping does not degrade quality.

Authors: We will expand the method section with the explicit equations governing the projection operation, pseudocode for the full asynchronous scheduling algorithm, and a complexity breakdown (showing the module's overhead is O(1) per keyframe interaction and negligible relative to skipped steps). We will also add ablation results confirming it restores coherence without introducing new artifacts. These elements will be incorporated in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic scheduling with empirical validation

full rationale

The paper introduces RhymeFlow as a training-free scheduling framework that identifies sparse keyframes for dense denoising and allows non-keyframes to skip steps, justified by an empirical observation on frame predictability in natural video. No equations, fitted parameters, or derivations are presented that reduce to inputs by construction. No self-citations are load-bearing for the core method. The central claims rest on experimental comparisons rather than any self-referential or fitted-input logic, rendering the contribution self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method rests on a domain assumption about frame predictability in natural video and introduces a new projection module whose effectiveness lacks independent evidence in the abstract; keyframe selection criteria appear as an unspecified choice.

free parameters (1)

Keyframe identification criteria
The method requires selecting a sparse set of pivotal keyframes, but the abstract gives no explicit rule, threshold, or algorithm for this choice.

axioms (1)

domain assumption Adjacent frames have corresponding contents and motions so that non-keyframe states follow predictable trajectories once keyframes are anchored.
This observation is invoked to justify skipping denoising steps for non-keyframes.

invented entities (1)

Latent trajectory projection module no independent evidence
purpose: Restores temporal coherence by allowing keyframes to interact with a complete sequence representation when non-keyframes skip steps.
New component introduced to counteract degradation from asynchronous skipping.

pith-pipeline@v0.9.1-grok · 5814 in / 1250 out tokens · 40665 ms · 2026-06-28T02:06:34.739076+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 17 canonical work pages · 9 internal anchors

[1]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, October 2021

2021
[2]

Token merging for fast stable diffusion

Daniel Bolya and Judy Hoffman. Token merging for fast stable diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603, 2023. 12

2023
[3]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

2024
[4]

Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, and Jiaqi Wang. Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

work page arXiv 2025
[5]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024

2024
[6]

First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

work page arXiv 2025
[7]

Diffusion models in vision: A survey

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023

2023
[8]

Autoregressive Video Generation without Vector Quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. In European Conference on Computer Vision, pages 205–224. Springer, 2024

2024
[10]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InEuropean Conference on Computer Vision, pages 393–411. Springer, 2024

2024
[11]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020

2020
[12]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advancesin Neural Information Processing Systems, volume 35, pages 8633–8646. Curran Associates, Inc., 2022

2022
[13]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[15]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advancesin Neural Information Processing Systems, 37:52481–52515, 2024

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advancesin Neural Information Processing Systems, 37:52481–52515, 2024

2024
[16]

Adaptive caching for faster video generation with diffusion transformers

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15240–15252, 2025

2025
[17]

Text2video-zero: Text-to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

2023
[18]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

A survey on long video generation: Challenges, methods, and prospects.arXiv preprint arXiv:2403.16407, 2024

Chengxuan Li, Di Huang, Zeyu Lu, Yang Xiao, Qingqi Pei, and Lei Bai. A survey on long video generation: Challenges, methods, and prospects.arXiv preprint arXiv:2403.16407, 2024. 13

work page arXiv 2024
[20]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[21]

Q-diffusion: Quantizing diffusion models

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17535–17545, October 2023

2023
[22]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan LI, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5775–5787. Curran Associates, Inc., 2022

2022
[23]

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, June 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, June 2025

2025
[24]

Freelong: Training-free long video generation with spectralblend temporal attention

Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention. Advancesin Neural Information Processing Systems, 37:131434–131455, 2024

2024
[25]

Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K. Wong. Fastercache: Training-free video diffusion model acceleration with high quality. Inarxiv, 2024

2024
[26]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Deepcache: Accelerating diffusion models for free, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free, 2023

2023
[28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, October 2023

2023
[29]

Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

work page arXiv 2024
[30]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

2022
[31]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternationalConference on Learning Representations, 2021

2021
[32]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Training-free consistent text-to-image generation.ACM Transactionson Graphics (TOG), 43(4):1–18, 2024

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consistent text-to-image generation.ACM Transactionson Graphics (TOG), 43(4):1–18, 2024

2024
[34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

The solution-diffusion model: a review.Journal of membrane science, 107(1-2):1–21, 1995

Johannes G Wijmans and Richard W Baker. The solution-diffusion model: a review.Journal of membrane science, 107(1-2):1–21, 1995

1995
[36]

Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity. arXiv preprint arXiv:2502.01776, 2025

work page arXiv 2025
[37]

Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025

Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025. 14

work page arXiv 2025
[38]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024

2024
[40]

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. arXiv preprint arXiv:2505.18875, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Freedom: Training-free energy-guided conditional diffusion model

Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. Freedom: Training-free energy-guided conditional diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23174–23184, 2023

2023
[43]

Vgdfr: Diffusion-based video generation with dynamic latent frame rate, 2025

Zhihang Yuan, Rui Xie, Yuzhang Shang, Hanling Zhang, Siyuan Wang, Shengen Yan, Guohao Dai, and Yu Wang. Vgdfr: Diffusion-based video generation with dynamic latent frame rate, 2025

2025
[44]

Training-free and hardware-friendly acceleration for diffusion models via similarity-based token pruning

Evelyn Zhang, Jiayi Tang, Xuefei Ning, and Linfeng Zhang. Training-free and hardware-friendly acceleration for diffusion models via similarity-based token pruning. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-SeventhConference on InnovativeApplications of Artificial Intelligence and Fifteenth Symposium on Educati...

2025
[45]

Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization. InInternational Conference on Machine Learning (ICML), 2025

2025
[46]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "Atlas" Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Ne...

2023
[47]

Real-time video generation with pyramid attention broadcast

Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024

work page arXiv 2024
[48]

Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics

Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advancesin Neural Information Processing Systems, volume 36, pages 55502–55542. Curran Associates, Inc., 2023

2023
[49]

Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching

Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching. arXiv preprint arXiv:2507.02860, 2025. 15 A More Experimental Results A.1 Additional Visualization Results We present further qualita...

work page arXiv 2025
[50]

Keyframe Identification: The computation of frame-to-frame latent similarity (e.g., cosine similarity) and the selection algorithm (clustering or thresholding) consume GPU cycles. 21
[51]

• Hardware Efficiency & Memory Access Constraints:The reduction in FLOPs does not translate linearly to latency reduction due to decreased GPU utilization during the skip steps

Latent Trajectories Projection: Generating intermediate states (xt−1) for skipped frames via flow-based latent projection requires additional vector operations, which, while lightweight, are not negligible. • Hardware Efficiency & Memory Access Constraints:The reduction in FLOPs does not translate linearly to latency reduction due to decreased GPU utiliza...
[52]

This reduces the attention sequence length from Sfull = 75 , 600to Sskip = 18,000

Reduced Parallelism: During skip steps, the model processes only M = 5keyframes instead of the full F = 21frames. This reduces the attention sequence length from Sfull = 75 , 600to Sskip = 18,000
[53]

Rhythmic Point

GPU Occupancy Drop: On high-performance GPUs, such a significant reduction in sequence length (∼ 76%decrease) lowers the kernel occupancy. The workload shifts from being compute-bound to memory-bound, meaning the GPU cores spend more time waiting for data transfer than performing calculations. Consequently, the effective TFLOPs/s achieved during skip step...

[1] [1]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, October 2021

2021

[2] [2]

Token merging for fast stable diffusion

Daniel Bolya and Judy Hoffman. Token merging for fast stable diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603, 2023. 12

2023

[3] [3]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

2024

[4] [4]

Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, and Jiaqi Wang. Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

work page arXiv 2025

[5] [5]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024

2024

[6] [6]

First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

work page arXiv 2025

[7] [7]

Diffusion models in vision: A survey

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023

2023

[8] [8]

Autoregressive Video Generation without Vector Quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. In European Conference on Computer Vision, pages 205–224. Springer, 2024

2024

[10] [10]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InEuropean Conference on Computer Vision, pages 393–411. Springer, 2024

2024

[11] [11]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020

2020

[12] [12]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advancesin Neural Information Processing Systems, volume 35, pages 8633–8646. Curran Associates, Inc., 2022

2022

[13] [13]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[15] [15]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advancesin Neural Information Processing Systems, 37:52481–52515, 2024

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advancesin Neural Information Processing Systems, 37:52481–52515, 2024

2024

[16] [16]

Adaptive caching for faster video generation with diffusion transformers

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15240–15252, 2025

2025

[17] [17]

Text2video-zero: Text-to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

2023

[18] [18]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

A survey on long video generation: Challenges, methods, and prospects.arXiv preprint arXiv:2403.16407, 2024

Chengxuan Li, Di Huang, Zeyu Lu, Yang Xiao, Qingqi Pei, and Lei Bai. A survey on long video generation: Challenges, methods, and prospects.arXiv preprint arXiv:2403.16407, 2024. 13

work page arXiv 2024

[20] [20]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[21] [21]

Q-diffusion: Quantizing diffusion models

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17535–17545, October 2023

2023

[22] [22]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan LI, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 5775–5787. Curran Associates, Inc., 2022

2022

[23] [23]

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, June 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, June 2025

2025

[24] [24]

Freelong: Training-free long video generation with spectralblend temporal attention

Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention. Advancesin Neural Information Processing Systems, 37:131434–131455, 2024

2024

[25] [25]

Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K. Wong. Fastercache: Training-free video diffusion model acceleration with high quality. Inarxiv, 2024

2024

[26] [26]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Deepcache: Accelerating diffusion models for free, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free, 2023

2023

[28] [28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, October 2023

2023

[29] [29]

Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

work page arXiv 2024

[30] [30]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

2022

[31] [31]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternationalConference on Learning Representations, 2021

2021

[32] [32]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Training-free consistent text-to-image generation.ACM Transactionson Graphics (TOG), 43(4):1–18, 2024

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consistent text-to-image generation.ACM Transactionson Graphics (TOG), 43(4):1–18, 2024

2024

[34] [34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

The solution-diffusion model: a review.Journal of membrane science, 107(1-2):1–21, 1995

Johannes G Wijmans and Richard W Baker. The solution-diffusion model: a review.Journal of membrane science, 107(1-2):1–21, 1995

1995

[36] [36]

Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity. arXiv preprint arXiv:2502.01776, 2025

work page arXiv 2025

[37] [37]

Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025

Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025. 14

work page arXiv 2025

[38] [38]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024

2024

[40] [40]

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. arXiv preprint arXiv:2505.18875, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Freedom: Training-free energy-guided conditional diffusion model

Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. Freedom: Training-free energy-guided conditional diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23174–23184, 2023

2023

[43] [43]

Vgdfr: Diffusion-based video generation with dynamic latent frame rate, 2025

Zhihang Yuan, Rui Xie, Yuzhang Shang, Hanling Zhang, Siyuan Wang, Shengen Yan, Guohao Dai, and Yu Wang. Vgdfr: Diffusion-based video generation with dynamic latent frame rate, 2025

2025

[44] [44]

Training-free and hardware-friendly acceleration for diffusion models via similarity-based token pruning

Evelyn Zhang, Jiayi Tang, Xuefei Ning, and Linfeng Zhang. Training-free and hardware-friendly acceleration for diffusion models via similarity-based token pruning. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-SeventhConference on InnovativeApplications of Artificial Intelligence and Fifteenth Symposium on Educati...

2025

[45] [45]

Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization. InInternational Conference on Machine Learning (ICML), 2025

2025

[46] [46]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "Atlas" Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Ne...

2023

[47] [47]

Real-time video generation with pyramid attention broadcast

Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024

work page arXiv 2024

[48] [48]

Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics

Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advancesin Neural Information Processing Systems, volume 36, pages 55502–55542. Curran Associates, Inc., 2023

2023

[49] [49]

Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching

Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching. arXiv preprint arXiv:2507.02860, 2025. 15 A More Experimental Results A.1 Additional Visualization Results We present further qualita...

work page arXiv 2025

[50] [50]

Keyframe Identification: The computation of frame-to-frame latent similarity (e.g., cosine similarity) and the selection algorithm (clustering or thresholding) consume GPU cycles. 21

[51] [51]

• Hardware Efficiency & Memory Access Constraints:The reduction in FLOPs does not translate linearly to latency reduction due to decreased GPU utilization during the skip steps

Latent Trajectories Projection: Generating intermediate states (xt−1) for skipped frames via flow-based latent projection requires additional vector operations, which, while lightweight, are not negligible. • Hardware Efficiency & Memory Access Constraints:The reduction in FLOPs does not translate linearly to latency reduction due to decreased GPU utiliza...

[52] [52]

This reduces the attention sequence length from Sfull = 75 , 600to Sskip = 18,000

Reduced Parallelism: During skip steps, the model processes only M = 5keyframes instead of the full F = 21frames. This reduces the attention sequence length from Sfull = 75 , 600to Sskip = 18,000

[53] [53]

Rhythmic Point

GPU Occupancy Drop: On high-performance GPUs, such a significant reduction in sequence length (∼ 76%decrease) lowers the kernel occupancy. The workload shifts from being compute-bound to memory-bound, meaning the GPU cores spend more time waiting for data transfer than performing calculations. Consequently, the effective TFLOPs/s achieved during skip step...