Dynamic Video Generation: Shaping Video Generation Across Time and Space

Guantao Chen; Jiacheng Liu; Jingkai Huang; Linfeng Zhang; Lixuan; Peiliang Cai; Shikang Zheng; Yuqi Lin

arxiv: 2605.21042 · v1 · pith:CHTLB4DLnew · submitted 2026-05-20 · 💻 cs.CV

Dynamic Video Generation: Shaping Video Generation Across Time and Space

Shikang Zheng , Jingkai Huang , Jiacheng Liu , Guantao Chen , Lixuan , Yuqi Lin , Peiliang Cai , Linfeng Zhang This is my paper

Pith reviewed 2026-05-21 05:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationdiffusion modelsaccelerationdynamic samplingspatio-temporal allocationprogressive resolutionefficient inferencecontent-aware methods

0 comments

The pith

DVG dynamically allocates computation across time and space to accelerate video diffusion models up to 7 times with near-lossless quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that the computational cost of iterative denoising in video diffusion models can be reduced by jointly compressing resolution in both spatial and temporal dimensions during early stages. It introduces an automatic, content-aware mechanism to choose the right compression strategy for each video without manual tuning or retraining. A sympathetic reader would care because current video generation remains too slow and expensive for broad use; success here would make high-quality synthesis practical on more accessible hardware. The reported results include up to 7 times speedup on models like HunyuanVideo and 18 times when combined with distillation, while holding output quality close to the full-computation baseline.

Core claim

DVG is a framework that jointly allocates computation across time and space by automatically selecting content-aware acceleration strategies for progressive resolution sampling in the denoising process of video generation models. This approach reduces the number of tokens processed at each timestep according to the specific spatio-temporal demands of the input video, delivering near-lossless acceleration without changes to the underlying model or task-specific adjustments.

What carries the argument

The DVG framework, which performs automatic content-aware selection of spatio-temporal compression strategies to enable progressive resolution sampling during denoising.

If this is right

Applies directly to existing video diffusion models such as HunyuanVideo without requiring retraining.
Delivers consistent near-lossless speedups across multiple video generation tasks.
Combines with distillation to reach multiplicative gains up to 18 times.
Reduces token volume in early denoising steps while adapting to each video's content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive allocation principle could apply to other generative tasks involving high-dimensional data such as 3D or audio.
Widespread use might lower the hardware barriers and energy demands of large-scale video synthesis.
Acceleration research may shift toward content-dependent schedules rather than uniform reduction patterns.

Load-bearing premise

That content-aware automatic selection of acceleration strategies across time and space will maintain near-lossless quality without manual tuning or retraining for diverse videos and tasks.

What would settle it

A side-by-side comparison on videos with high motion or fine detail where DVG produces visible artifacts or lower coherence scores than the full-resolution baseline.

Figures

Figures reproduced from arXiv: 2605.21042 by Guantao Chen, Jiacheng Liu, Jingkai Huang, Linfeng Zhang, Lixuan, Peiliang Cai, Shikang Zheng, Yuqi Lin.

**Figure 1.** Figure 1: Videos generated on HunyuanVideo-1.5 using DVG with distillation at 18× speedup. 1 Introduction Diffusion models have recently achieved remarkable performance in video generation, delivering stateof-the-art fidelity and diversity. However, their strong generation capability comes with substantial inference cost. Due to the iterative nature of diffusion sampling, each sample typically requires †Correspondi… view at source ↗

**Figure 2.** Figure 2: Different videos require different spatio-temporal compression strategies. Compressing only a single dimension often causes motion rigidity or visual artifacts. dozens of forward passes through a large transformer backbone, making real-time generation and deployment on resource-constrained devices highly challenging. This computational bottleneck has motivated extensive research on efficient diffusion infe… view at source ↗

**Figure 3.** Figure 3: DVG Framework. DVG first predicts a coarse latent video sketch, then analyzes its spatial and temporal demands directly in latent space. Under a target compute budget, DVG selects the best compression action for denoising, then restores the latent to the original setting for refinement. where α¯t = Qt s=1 αs, and σt controls the stochasticity. In video diffusion models, the latent variable xt represents a … view at source ↗

**Figure 4.** Figure 4: Visualization of different acceleration methods on HunyuanVideo. DVG achieves better semantic alignment and visual quality, whereas prior methods degrade under high acceleration. 5 Discussion Ablation Study. We conduct our ablation study on HunyuanVideo-1.5 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of DVG on HunyuanVideo-1.5. DVG successfully maintains high-fidelity and motion consistency with the original video, even under distillation, reaching up to 18× speedup. Latent or Pixel Evaluation? We compare two strategies for estimating videos’ spatial and temporal demand: evaluating directly in latent space versus decoding to pixel space. Across 1118 prompts, the two strategies achieve a T… view at source ↗

read the original abstract

Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today's large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DVG extends progressive resolution sampling to joint content-aware allocation over time and space in video diffusion, reporting big speedups but with thin details on how quality is actually preserved.

read the letter

The main point is that this paper adapts progressive resolution ideas to video by letting the model automatically pick how much to compress in both the spatial and temporal dimensions based on the input content. That joint handling is the step beyond single-dimension tricks, and it makes sense given how video demands vary across frames and regions. They test on HunyuanVideo and similar models, claiming up to 7x speedup standalone and 18x when stacked with distillation, all while calling the quality near-lossless and without per-video tuning or retraining. Releasing the code is a straightforward plus for anyone who wants to check the implementation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DVG, a Dynamic Video Generation framework for diffusion-based video models. It jointly allocates computation across temporal and spatial dimensions via a content-aware automatic strategy selector that requires no manual tuning or retraining. The central claims are near-lossless quality preservation together with speedups reaching 7x on HunyuanVideo and HunyuanVideo-1.5, and 18x when combined with distillation.

Significance. If the empirical claims hold under rigorous evaluation, the work would be a useful practical contribution to efficient video generation. Extending progressive resolution ideas to joint spatio-temporal control addresses a genuine scaling challenge. The stated plan to release code is a clear strength for reproducibility.

major comments (2)

[Abstract] Abstract: The abstract asserts specific speedups and 'near-lossless' quality but supplies no information on the quantitative metrics employed (FID, FVD, LPIPS, or human studies), the baselines, the number and diversity of test videos, or statistical significance. This absence directly undermines assessment of the central claim that the automatic selector maintains quality across arbitrary videos and tasks.
[§4 and §3.2] §4 (Experiments) and §3.2 (Selector): The manuscript must demonstrate that the content-aware decision rule remains robust on videos containing rapid camera motion, high-frequency textures, or long-range temporal dependencies. Without such targeted ablations or failure-case analysis, the assumption that the proxy signal correctly identifies safe compression regions remains unverified and load-bearing for the 'near-lossless' guarantee.

minor comments (2)

[§2] §2 (Related Work): Add explicit comparison to recent image-only progressive sampling methods and to any concurrent video acceleration techniques that also operate on token budgets.
[Figure 2 and Table 1] Figure 2 and Table 1: Ensure axis labels, error bars, and caption text make clear which rows/curves correspond to the joint time-space selector versus single-dimension baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's practical value. We address each major comment below and have revised the manuscript to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts specific speedups and 'near-lossless' quality but supplies no information on the quantitative metrics employed (FID, FVD, LPIPS, or human studies), the baselines, the number and diversity of test videos, or statistical significance. This absence directly undermines assessment of the central claim that the automatic selector maintains quality across arbitrary videos and tasks.

Authors: We agree that the abstract should be more self-contained to support the central claims. In the revised version we have expanded the abstract to specify that quality preservation is measured via FVD, LPIPS, and human preference studies; that baselines include both standard DDIM sampling and single-dimension progressive resolution; that evaluation uses a diverse set of 100 videos spanning multiple domains and motion types; and that results report means and standard deviations across three random seeds. These details were already present in §4 but are now summarized in the abstract for immediate clarity. revision: yes
Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Selector): The manuscript must demonstrate that the content-aware decision rule remains robust on videos containing rapid camera motion, high-frequency textures, or long-range temporal dependencies. Without such targeted ablations or failure-case analysis, the assumption that the proxy signal correctly identifies safe compression regions remains unverified and load-bearing for the 'near-lossless' guarantee.

Authors: We acknowledge that explicit validation on these challenging regimes is necessary to substantiate the robustness of the selector. While §4 already contains results across varied content, the revised manuscript adds a new subsection with targeted ablations on rapid camera motion (sports and handheld footage), high-frequency textures (detailed natural scenes), and long-range temporal dependencies (extended narrative clips). We report FVD deltas relative to the full-computation baseline, include qualitative failure-case examples where the proxy signal is less reliable, and show that the content-aware rule still keeps degradation below 5 % on average. These additions directly address the load-bearing assumption. revision: yes

Circularity Check

0 steps flagged

DVG framework introduces novel content-aware allocation without circular reduction to inputs

full rationale

The paper proposes DVG as a new framework for jointly allocating computation across temporal and spatial dimensions via automatic, content-aware strategy selection. No equations, derivations, or first-principles results are shown that reduce by construction to fitted parameters, self-citations, or prior ansatzes from the same authors. The acceleration claims rest on empirical evaluation across models like HunyuanVideo rather than any self-referential loop. The selection mechanism is presented as an original contribution, not derived from or equivalent to its own inputs. This is a standard case of an applied systems paper with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that diffusion model token processing can be selectively compressed in a content-dependent way without introducing artifacts, plus standard assumptions from prior diffusion literature.

axioms (1)

domain assumption Progressive resolution sampling can be extended to joint spatio-temporal dimensions while preserving generation quality.
Invoked in the motivation for handling diverse spatio-temporal demands in video.

pith-pipeline@v0.9.0 · 5719 in / 1058 out tokens · 28917 ms · 2026-05-21T05:38:34.117658+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 12 internal anchors

[1]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer, 2020. URLhttps://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Bolya and J

D. Bolya and J. Hoffman. Token merging for fast stable diffusion, 2023. URL https: //arxiv.org/abs/2303.17604

work page arXiv 2023
[3]

Bolya and J

D. Bolya and J. Hoffman. Token merging for fast stable diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603, 2023

work page 2023
[4]

Token Merging: Your ViT But Faster

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. Token merging: Your vit but faster, 2023. URLhttps://arxiv.org/abs/2210.09461

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

J. Chen, C. Ge, E. Xie, Y . Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024

work page 2024
[6]

J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y . Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International Conference on Learning Representations, 2024

work page 2024
[7]

Cheng, Z

X. Cheng, Z. Chen, and Z. Jia. Cat pruning: Cluster-aware token pruning for text-to-image diffusion models, 2025. URLhttps://arxiv.org/abs/2502.00433

work page arXiv 2025
[8]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers, 2019. URLhttps://arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

G. Fang, X. Ma, and X. Wang. Structural pruning for diffusion models.arXiv preprint arXiv:2305.10924, 2023

work page arXiv 2023
[10]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[11]

J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for high fidelity image generation, 2021. URLhttps://arxiv.org/abs/2106.15282

work page arXiv 2021
[12]

Vbench: Comprehensive benchmark suite for video generative models, 2023

Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y . Wang, X. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu. VBench: Comprehensive Benchmark Suite for Video Generative Models, Nov. 2023. URL http://arxiv.org/abs/2311.17982. arXiv:2311.17982 [cs]

work page arXiv 2023
[13]

Jeong, K

W. Jeong, K. Lee, H. Seo, and S. Y . Chun. Training-free mixed-resolution latent upsampling for spatially accelerated diffusion transformers, 2026. URL https://arxiv.org/abs/2507. 08422

work page 2026
[14]

Ryoo, and Tian Xie

K. Kahatapitiya, H. Liu, S. He, D. Liu, M. Jia, C. Zhang, M. S. Ryoo, and T. Xie. Adaptive caching for faster video generation with diffusion transformers, 2024. URL https://arxiv. org/abs/2411.02397

work page arXiv 2024
[15]

M. Kim, S. Gao, Y .-C. Hsu, Y . Shen, and H. Jin. Token fusion: Bridging the gap between token pruning and token merging. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383–1392, 2024

work page 2024
[16]

S. Kim, H. Lee, W. Cho, M. Park, and W. W. Ro. Ditto: Accelerating diffusion model via temporal value similarity. InProceedings of the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2025

work page 2025
[17]

H. Li, Y . Yang, M. Chang, H. Feng, Z. Xu, Q. Li, and Y . Chen. Srdiff: Single image super- resolution with diffusion probabilistic models, 2021. URL https://arxiv.org/abs/2104. 14951

work page 2021
[18]

X. Li, Y . Liu, L. Lian, H. Yang, Z. Dong, D. Kang, S. Zhang, and K. Keutzer. Q-diffusion: Quantizing diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17489–17499, 2023. doi: 10.1109/ICCV51070.2023.01608. 10

work page doi:10.1109/iccv51070.2023.01608 2023
[19]

F. Liu, S. Zhang, X. Wang, Y . Wei, H. Qiu, Y . Zhao, Y . Zhang, Q. Ye, and F. Wan. Timestep embedding tells: It’s time to cache for video diffusion model, 2025. URL https://arxiv. org/abs/2411.19108

work page arXiv 2025
[20]

J. Liu, C. Zou, Y . Lyu, J. Chen, and L. Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers, 2025. URLhttps://arxiv.org/abs/2503.06923

work page arXiv 2025
[21]

X. Liu, C. Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[22]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

work page 2022
[23]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

W. Lu, S. Zheng, Y . Xia, and S. Wang. Toma: Token merge with attention for diffusion models,

work page
[25]

URLhttps://arxiv.org/abs/2509.10918

work page arXiv
[26]

Z. Luo, D. Chen, Y . Zhang, Y . Huang, L. Wang, Y . Shen, D. Zhao, J. Zhou, and T. Tan. Videofusion: Decomposed diffusion models for high-quality video generation, 2023. URL https://arxiv.org/abs/2303.08320

work page arXiv 2023
[27]

D. Menn, Y . Yang, B. Wang, X. Wei, M. Munir, F. Liang, R. Marculescu, C. Xu, and D. Mar- culescu. Video compression meets video generation: Latent inter-frame pruning with attention recovery, 2026. URLhttps://arxiv.org/abs/2603.05811

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[29]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

work page 2015
[30]

Saharia, J

C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi. Image super-resolution via iterative refinement, 2021. URLhttps://arxiv.org/abs/2104.07636

work page arXiv 2021
[31]

Progressive Distillation for Fast Sampling of Diffusion Models

T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Shang, Z

Y . Shang, Z. Yuan, B. Xie, B. Wu, and Y . Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981, 2023

work page 1972
[33]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015
[34]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021

work page 2021
[35]

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023

work page 2023
[36]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

X. Sun et al. Hunyuanvideo: A systematic framework for large video generation models, 2024. URLhttps://arxiv.org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Y . Tian, X. Xia, Y . Ren, S. Lin, X. Wang, X. Xiao, Y . Tong, L. Yang, and B. Cui. Training-free diffusion acceleration with bottleneck sampling, 2025. URL https://arxiv.org/abs/2503. 18940. 11

work page 2025
[38]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

B. Wu, C. Zou, C. Li, D. Huang, F. Yang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jiang, Linus, Patrol, P. Zhang, P. Chen, P. Zhao, Q. Tian, S. Liu, W. Kong, W. Wang, X. He, X. Li, X. Deng, X. Zhe, Y . Li, Y . Long, Y . Peng, Y . Wu, Y . Liu, Z. Wang, Z. Dai, B. Peng, C. Li, G. Gong, G. Xiao, J. Tian, J. Lin, J. Liu, J. Zhang, J. Lian, K. Pan, L. Wang, L. Niu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, X. Gu, Yuxuan.Zhang, W. Wang, Y . Cheng, B. Xu, Y . Dong, and J. Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=LQzN6TRFg9

work page 2025
[41]

Z. Yuan, R. Xie, Y . Shang, H. Zhang, S. Wang, S. Yan, G. Dai, and Y . Wang. Vgdfr: Diffusion- based video generation with dynamic latent frame rate, 2025. URL https://arxiv.org/ abs/2504.12259

work page arXiv 2025
[42]

Big Bird: Transformers for Longer Sequences

M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences, 2021. URL https://arxiv.org/abs/2007.14062

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Zhang, B

E. Zhang, B. Xiao, J. Tang, Q. Ma, C. Zou, X. Ning, X. Hu, and L. Zhang. Token pruning for caching better: 9 times acceleration on stable diffusion for free, 2024. URL https: //arxiv.org/abs/2501.00375

work page arXiv 2024
[44]

Zhang, J

E. Zhang, J. Tang, X. Ning, and L. Zhang. Training-free and hardware-friendly acceleration for diffusion models via similarity-based token pruning. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[45]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

J. Zhang, C. Xiang, H. Huang, J. Wei, H. Xi, J. Zhu, and J. Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025. URL https://arxiv. org/abs/2502.18137

work page arXiv 2025
[46]

Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025

Y . Zhang, J. Xing, B. Xia, S. Liu, B. Peng, X. Tao, P. Wan, E. Lo, and J. Jia. Training-free efficient video generation via dynamic token carving, 2025. URL https://arxiv.org/abs/ 2505.16864

work page arXiv 2025
[47]

Zheng, C

K. Zheng, C. Lu, J. Chen, and J. Zhu. DPM-solver-v3: Improved diffusion ODE solver with empirical model statistics. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=9fWKExmKa0

work page 2023
[48]

Zheng, Y

K. Zheng, Y . Wang, Q. Ma, H. Chen, J. Zhang, Y . Balaji, J. Chen, M.-Y . Liu, J. Zhu, and Q. Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency,

work page
[49]

URLhttps://arxiv.org/abs/2510.08431

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Zheng, G

S. Zheng, G. Chen, Q. Zhou, Y . Lin, L. He, C. Zou, P. Cai, J. Liu, and L. Zhang. Let features decide their own solvers: Hybrid feature caching for diffusion transformers, 2025. URL https://arxiv.org/abs/2510.04188

work page arXiv 2025
[51]

Zheng, L

S. Zheng, L. Feng, X. Wang, Q. Zhou, P. Cai, C. Zou, J. Liu, Y . Lin, J. Chen, Y . Ma, and L. Zhang. Forecast then calibrate: Feature caching as ode for efficient diffusion transformers,

work page
[52]

URLhttps://arxiv.org/abs/2508.16211. 12

work page arXiv
[53]

From sketch to fresco: Efficient diffusion transformer with progressive resolution, 2026

S. Zheng, G. Chen, L. He, J. Liu, Y . Lin, C. Zou, and L. Zhang. From sketch to fresco: Efficient diffusion transformer with progressive resolution, 2026. URL https://arxiv.org/abs/ 2601.07462

work page arXiv 2026
[54]

Zheng, X

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You. Open-sora: Democratizing efficient video production for all, March 2024. URL https://github.com/ hpcaitech/Open-Sora

work page 2024
[55]

H. Zhu, D. Tang, J. Liu, M. Lu, J. Zheng, J. Peng, D. Li, Y . Wang, F. Jiang, L. Tian, S. Tiwari, A. Sirasao, J.-H. Yong, B. Wang, and E. Barsoum. Dip-go: A diffusion pruner via few-step gradient optimization, 2024

work page 2024
[56]

C. Zou, C. Li, Y . Li, P. Li, J. Wu, X. He, S. Liu, Z. Zhong, K. Huang, and L. Zhang. Disca: Accelerating video diffusion transformers with distillation-compatible learnable feature caching,

work page
[57]

URLhttps://arxiv.org/abs/2602.05449. 13

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer, 2020. URLhttps://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

Bolya and J

D. Bolya and J. Hoffman. Token merging for fast stable diffusion, 2023. URL https: //arxiv.org/abs/2303.17604

work page arXiv 2023

[3] [3]

Bolya and J

D. Bolya and J. Hoffman. Token merging for fast stable diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603, 2023

work page 2023

[4] [4]

Token Merging: Your ViT But Faster

D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. Token merging: Your vit but faster, 2023. URLhttps://arxiv.org/abs/2210.09461

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

J. Chen, C. Ge, E. Xie, Y . Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024

work page 2024

[6] [6]

J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y . Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International Conference on Learning Representations, 2024

work page 2024

[7] [7]

Cheng, Z

X. Cheng, Z. Chen, and Z. Jia. Cat pruning: Cluster-aware token pruning for text-to-image diffusion models, 2025. URLhttps://arxiv.org/abs/2502.00433

work page arXiv 2025

[8] [8]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers, 2019. URLhttps://arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

G. Fang, X. Ma, and X. Wang. Structural pruning for diffusion models.arXiv preprint arXiv:2305.10924, 2023

work page arXiv 2023

[10] [10]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[11] [11]

J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for high fidelity image generation, 2021. URLhttps://arxiv.org/abs/2106.15282

work page arXiv 2021

[12] [12]

Vbench: Comprehensive benchmark suite for video generative models, 2023

Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y . Wang, X. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu. VBench: Comprehensive Benchmark Suite for Video Generative Models, Nov. 2023. URL http://arxiv.org/abs/2311.17982. arXiv:2311.17982 [cs]

work page arXiv 2023

[13] [13]

Jeong, K

W. Jeong, K. Lee, H. Seo, and S. Y . Chun. Training-free mixed-resolution latent upsampling for spatially accelerated diffusion transformers, 2026. URL https://arxiv.org/abs/2507. 08422

work page 2026

[14] [14]

Ryoo, and Tian Xie

K. Kahatapitiya, H. Liu, S. He, D. Liu, M. Jia, C. Zhang, M. S. Ryoo, and T. Xie. Adaptive caching for faster video generation with diffusion transformers, 2024. URL https://arxiv. org/abs/2411.02397

work page arXiv 2024

[15] [15]

M. Kim, S. Gao, Y .-C. Hsu, Y . Shen, and H. Jin. Token fusion: Bridging the gap between token pruning and token merging. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383–1392, 2024

work page 2024

[16] [16]

S. Kim, H. Lee, W. Cho, M. Park, and W. W. Ro. Ditto: Accelerating diffusion model via temporal value similarity. InProceedings of the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2025

work page 2025

[17] [17]

H. Li, Y . Yang, M. Chang, H. Feng, Z. Xu, Q. Li, and Y . Chen. Srdiff: Single image super- resolution with diffusion probabilistic models, 2021. URL https://arxiv.org/abs/2104. 14951

work page 2021

[18] [18]

X. Li, Y . Liu, L. Lian, H. Yang, Z. Dong, D. Kang, S. Zhang, and K. Keutzer. Q-diffusion: Quantizing diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17489–17499, 2023. doi: 10.1109/ICCV51070.2023.01608. 10

work page doi:10.1109/iccv51070.2023.01608 2023

[19] [19]

F. Liu, S. Zhang, X. Wang, Y . Wei, H. Qiu, Y . Zhao, Y . Zhang, Q. Ye, and F. Wan. Timestep embedding tells: It’s time to cache for video diffusion model, 2025. URL https://arxiv. org/abs/2411.19108

work page arXiv 2025

[20] [20]

J. Liu, C. Zou, Y . Lyu, J. Chen, and L. Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers, 2025. URLhttps://arxiv.org/abs/2503.06923

work page arXiv 2025

[21] [21]

X. Liu, C. Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[22] [22]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

work page 2022

[23] [23]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

W. Lu, S. Zheng, Y . Xia, and S. Wang. Toma: Token merge with attention for diffusion models,

work page

[25] [25]

URLhttps://arxiv.org/abs/2509.10918

work page arXiv

[26] [26]

Z. Luo, D. Chen, Y . Zhang, Y . Huang, L. Wang, Y . Shen, D. Zhao, J. Zhou, and T. Tan. Videofusion: Decomposed diffusion models for high-quality video generation, 2023. URL https://arxiv.org/abs/2303.08320

work page arXiv 2023

[27] [27]

D. Menn, Y . Yang, B. Wang, X. Wei, M. Munir, F. Liang, R. Marculescu, C. Xu, and D. Mar- culescu. Video compression meets video generation: Latent inter-frame pruning with attention recovery, 2026. URLhttps://arxiv.org/abs/2603.05811

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023

[29] [29]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

work page 2015

[30] [30]

Saharia, J

C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi. Image super-resolution via iterative refinement, 2021. URLhttps://arxiv.org/abs/2104.07636

work page arXiv 2021

[31] [31]

Progressive Distillation for Fast Sampling of Diffusion Models

T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Shang, Z

Y . Shang, Z. Yuan, B. Xie, B. Wu, and Y . Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981, 2023

work page 1972

[33] [33]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015

[34] [34]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021

work page 2021

[35] [35]

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023

work page 2023

[36] [36]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

X. Sun et al. Hunyuanvideo: A systematic framework for large video generation models, 2024. URLhttps://arxiv.org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Y . Tian, X. Xia, Y . Ren, S. Lin, X. Wang, X. Xiao, Y . Tong, L. Yang, and B. Cui. Training-free diffusion acceleration with bottleneck sampling, 2025. URL https://arxiv.org/abs/2503. 18940. 11

work page 2025

[38] [38]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

B. Wu, C. Zou, C. Li, D. Huang, F. Yang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jiang, Linus, Patrol, P. Zhang, P. Chen, P. Zhao, Q. Tian, S. Liu, W. Kong, W. Wang, X. He, X. Li, X. Deng, X. Zhe, Y . Li, Y . Long, Y . Peng, Y . Wu, Y . Liu, Z. Wang, Z. Dai, B. Peng, C. Li, G. Gong, G. Xiao, J. Tian, J. Lin, J. Liu, J. Zhang, J. Lian, K. Pan, L. Wang, L. Niu...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, X. Gu, Yuxuan.Zhang, W. Wang, Y . Cheng, B. Xu, Y . Dong, and J. Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=LQzN6TRFg9

work page 2025

[41] [41]

Z. Yuan, R. Xie, Y . Shang, H. Zhang, S. Wang, S. Yan, G. Dai, and Y . Wang. Vgdfr: Diffusion- based video generation with dynamic latent frame rate, 2025. URL https://arxiv.org/ abs/2504.12259

work page arXiv 2025

[42] [42]

Big Bird: Transformers for Longer Sequences

M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences, 2021. URL https://arxiv.org/abs/2007.14062

work page internal anchor Pith review Pith/arXiv arXiv 2021

[43] [43]

Zhang, B

E. Zhang, B. Xiao, J. Tang, Q. Ma, C. Zou, X. Ning, X. Hu, and L. Zhang. Token pruning for caching better: 9 times acceleration on stable diffusion for free, 2024. URL https: //arxiv.org/abs/2501.00375

work page arXiv 2024

[44] [44]

Zhang, J

E. Zhang, J. Tang, X. Ning, and L. Zhang. Training-free and hardware-friendly acceleration for diffusion models via similarity-based token pruning. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025

[45] [45]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

J. Zhang, C. Xiang, H. Huang, J. Wei, H. Xi, J. Zhu, and J. Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025. URL https://arxiv. org/abs/2502.18137

work page arXiv 2025

[46] [46]

Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025

Y . Zhang, J. Xing, B. Xia, S. Liu, B. Peng, X. Tao, P. Wan, E. Lo, and J. Jia. Training-free efficient video generation via dynamic token carving, 2025. URL https://arxiv.org/abs/ 2505.16864

work page arXiv 2025

[47] [47]

Zheng, C

K. Zheng, C. Lu, J. Chen, and J. Zhu. DPM-solver-v3: Improved diffusion ODE solver with empirical model statistics. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=9fWKExmKa0

work page 2023

[48] [48]

Zheng, Y

K. Zheng, Y . Wang, Q. Ma, H. Chen, J. Zhang, Y . Balaji, J. Chen, M.-Y . Liu, J. Zhu, and Q. Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency,

work page

[49] [49]

URLhttps://arxiv.org/abs/2510.08431

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Zheng, G

S. Zheng, G. Chen, Q. Zhou, Y . Lin, L. He, C. Zou, P. Cai, J. Liu, and L. Zhang. Let features decide their own solvers: Hybrid feature caching for diffusion transformers, 2025. URL https://arxiv.org/abs/2510.04188

work page arXiv 2025

[51] [51]

Zheng, L

S. Zheng, L. Feng, X. Wang, Q. Zhou, P. Cai, C. Zou, J. Liu, Y . Lin, J. Chen, Y . Ma, and L. Zhang. Forecast then calibrate: Feature caching as ode for efficient diffusion transformers,

work page

[52] [52]

URLhttps://arxiv.org/abs/2508.16211. 12

work page arXiv

[53] [53]

From sketch to fresco: Efficient diffusion transformer with progressive resolution, 2026

S. Zheng, G. Chen, L. He, J. Liu, Y . Lin, C. Zou, and L. Zhang. From sketch to fresco: Efficient diffusion transformer with progressive resolution, 2026. URL https://arxiv.org/abs/ 2601.07462

work page arXiv 2026

[54] [54]

Zheng, X

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You. Open-sora: Democratizing efficient video production for all, March 2024. URL https://github.com/ hpcaitech/Open-Sora

work page 2024

[55] [55]

H. Zhu, D. Tang, J. Liu, M. Lu, J. Zheng, J. Peng, D. Li, Y . Wang, F. Jiang, L. Tian, S. Tiwari, A. Sirasao, J.-H. Yong, B. Wang, and E. Barsoum. Dip-go: A diffusion pruner via few-step gradient optimization, 2024

work page 2024

[56] [56]

C. Zou, C. Li, Y . Li, P. Li, J. Wu, X. He, S. Liu, Z. Zhong, K. Huang, and L. Zhang. Disca: Accelerating video diffusion transformers with distillation-compatible learnable feature caching,

work page

[57] [57]

URLhttps://arxiv.org/abs/2602.05449. 13

work page internal anchor Pith review Pith/arXiv arXiv