pith. sign in

arxiv: 2605.21042 · v1 · pith:CHTLB4DLnew · submitted 2026-05-20 · 💻 cs.CV

Dynamic Video Generation: Shaping Video Generation Across Time and Space

Pith reviewed 2026-05-21 05:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationdiffusion modelsaccelerationdynamic samplingspatio-temporal allocationprogressive resolutionefficient inferencecontent-aware methods
0
0 comments X

The pith

DVG dynamically allocates computation across time and space to accelerate video diffusion models up to 7 times with near-lossless quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that the computational cost of iterative denoising in video diffusion models can be reduced by jointly compressing resolution in both spatial and temporal dimensions during early stages. It introduces an automatic, content-aware mechanism to choose the right compression strategy for each video without manual tuning or retraining. A sympathetic reader would care because current video generation remains too slow and expensive for broad use; success here would make high-quality synthesis practical on more accessible hardware. The reported results include up to 7 times speedup on models like HunyuanVideo and 18 times when combined with distillation, while holding output quality close to the full-computation baseline.

Core claim

DVG is a framework that jointly allocates computation across time and space by automatically selecting content-aware acceleration strategies for progressive resolution sampling in the denoising process of video generation models. This approach reduces the number of tokens processed at each timestep according to the specific spatio-temporal demands of the input video, delivering near-lossless acceleration without changes to the underlying model or task-specific adjustments.

What carries the argument

The DVG framework, which performs automatic content-aware selection of spatio-temporal compression strategies to enable progressive resolution sampling during denoising.

If this is right

  • Applies directly to existing video diffusion models such as HunyuanVideo without requiring retraining.
  • Delivers consistent near-lossless speedups across multiple video generation tasks.
  • Combines with distillation to reach multiplicative gains up to 18 times.
  • Reduces token volume in early denoising steps while adapting to each video's content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptive allocation principle could apply to other generative tasks involving high-dimensional data such as 3D or audio.
  • Widespread use might lower the hardware barriers and energy demands of large-scale video synthesis.
  • Acceleration research may shift toward content-dependent schedules rather than uniform reduction patterns.

Load-bearing premise

That content-aware automatic selection of acceleration strategies across time and space will maintain near-lossless quality without manual tuning or retraining for diverse videos and tasks.

What would settle it

A side-by-side comparison on videos with high motion or fine detail where DVG produces visible artifacts or lower coherence scores than the full-resolution baseline.

Figures

Figures reproduced from arXiv: 2605.21042 by Guantao Chen, Jiacheng Liu, Jingkai Huang, Linfeng Zhang, Lixuan, Peiliang Cai, Shikang Zheng, Yuqi Lin.

Figure 1
Figure 1. Figure 1: Videos generated on HunyuanVideo-1.5 using DVG with distillation at 18× speedup. 1 Introduction Diffusion models have recently achieved remarkable performance in video generation, delivering state￾of-the-art fidelity and diversity. However, their strong generation capability comes with substantial inference cost. Due to the iterative nature of diffusion sampling, each sample typically requires †Correspondi… view at source ↗
Figure 2
Figure 2. Figure 2: Different videos require different spatio-temporal compression strategies. Compressing only a single dimension often causes motion rigidity or visual artifacts. dozens of forward passes through a large transformer backbone, making real-time generation and deployment on resource-constrained devices highly challenging. This computational bottleneck has motivated extensive research on efficient diffusion infe… view at source ↗
Figure 3
Figure 3. Figure 3: DVG Framework. DVG first predicts a coarse latent video sketch, then analyzes its spatial and temporal demands directly in latent space. Under a target compute budget, DVG selects the best compression action for denoising, then restores the latent to the original setting for refinement. where α¯t = Qt s=1 αs, and σt controls the stochasticity. In video diffusion models, the latent variable xt represents a … view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of different acceleration methods on HunyuanVideo. DVG achieves better semantic alignment and visual quality, whereas prior methods degrade under high acceleration. 5 Discussion Ablation Study. We conduct our ablation study on HunyuanVideo-1.5 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of DVG on HunyuanVideo-1.5. DVG successfully maintains high-fidelity and motion consistency with the original video, even under distillation, reaching up to 18× speedup. Latent or Pixel Evaluation? We compare two strategies for estimating videos’ spatial and temporal demand: evaluating directly in latent space versus decoding to pixel space. Across 1118 prompts, the two strategies achieve a T… view at source ↗
read the original abstract

Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today's large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DVG, a Dynamic Video Generation framework for diffusion-based video models. It jointly allocates computation across temporal and spatial dimensions via a content-aware automatic strategy selector that requires no manual tuning or retraining. The central claims are near-lossless quality preservation together with speedups reaching 7x on HunyuanVideo and HunyuanVideo-1.5, and 18x when combined with distillation.

Significance. If the empirical claims hold under rigorous evaluation, the work would be a useful practical contribution to efficient video generation. Extending progressive resolution ideas to joint spatio-temporal control addresses a genuine scaling challenge. The stated plan to release code is a clear strength for reproducibility.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts specific speedups and 'near-lossless' quality but supplies no information on the quantitative metrics employed (FID, FVD, LPIPS, or human studies), the baselines, the number and diversity of test videos, or statistical significance. This absence directly undermines assessment of the central claim that the automatic selector maintains quality across arbitrary videos and tasks.
  2. [§4 and §3.2] §4 (Experiments) and §3.2 (Selector): The manuscript must demonstrate that the content-aware decision rule remains robust on videos containing rapid camera motion, high-frequency textures, or long-range temporal dependencies. Without such targeted ablations or failure-case analysis, the assumption that the proxy signal correctly identifies safe compression regions remains unverified and load-bearing for the 'near-lossless' guarantee.
minor comments (2)
  1. [§2] §2 (Related Work): Add explicit comparison to recent image-only progressive sampling methods and to any concurrent video acceleration techniques that also operate on token budgets.
  2. [Figure 2 and Table 1] Figure 2 and Table 1: Ensure axis labels, error bars, and caption text make clear which rows/curves correspond to the joint time-space selector versus single-dimension baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's practical value. We address each major comment below and have revised the manuscript to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts specific speedups and 'near-lossless' quality but supplies no information on the quantitative metrics employed (FID, FVD, LPIPS, or human studies), the baselines, the number and diversity of test videos, or statistical significance. This absence directly undermines assessment of the central claim that the automatic selector maintains quality across arbitrary videos and tasks.

    Authors: We agree that the abstract should be more self-contained to support the central claims. In the revised version we have expanded the abstract to specify that quality preservation is measured via FVD, LPIPS, and human preference studies; that baselines include both standard DDIM sampling and single-dimension progressive resolution; that evaluation uses a diverse set of 100 videos spanning multiple domains and motion types; and that results report means and standard deviations across three random seeds. These details were already present in §4 but are now summarized in the abstract for immediate clarity. revision: yes

  2. Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Selector): The manuscript must demonstrate that the content-aware decision rule remains robust on videos containing rapid camera motion, high-frequency textures, or long-range temporal dependencies. Without such targeted ablations or failure-case analysis, the assumption that the proxy signal correctly identifies safe compression regions remains unverified and load-bearing for the 'near-lossless' guarantee.

    Authors: We acknowledge that explicit validation on these challenging regimes is necessary to substantiate the robustness of the selector. While §4 already contains results across varied content, the revised manuscript adds a new subsection with targeted ablations on rapid camera motion (sports and handheld footage), high-frequency textures (detailed natural scenes), and long-range temporal dependencies (extended narrative clips). We report FVD deltas relative to the full-computation baseline, include qualitative failure-case examples where the proxy signal is less reliable, and show that the content-aware rule still keeps degradation below 5 % on average. These additions directly address the load-bearing assumption. revision: yes

Circularity Check

0 steps flagged

DVG framework introduces novel content-aware allocation without circular reduction to inputs

full rationale

The paper proposes DVG as a new framework for jointly allocating computation across temporal and spatial dimensions via automatic, content-aware strategy selection. No equations, derivations, or first-principles results are shown that reduce by construction to fitted parameters, self-citations, or prior ansatzes from the same authors. The acceleration claims rest on empirical evaluation across models like HunyuanVideo rather than any self-referential loop. The selection mechanism is presented as an original contribution, not derived from or equivalent to its own inputs. This is a standard case of an applied systems paper with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that diffusion model token processing can be selectively compressed in a content-dependent way without introducing artifacts, plus standard assumptions from prior diffusion literature.

axioms (1)
  • domain assumption Progressive resolution sampling can be extended to joint spatio-temporal dimensions while preserving generation quality.
    Invoked in the motivation for handling diverse spatio-temporal demands in video.

pith-pipeline@v0.9.0 · 5719 in / 1058 out tokens · 28917 ms · 2026-05-21T05:38:34.117658+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 12 internal anchors

  1. [1]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer, 2020. URLhttps://arxiv.org/abs/2004.05150

  2. [2]

    Bolya and J

    D. Bolya and J. Hoffman. Token merging for fast stable diffusion, 2023. URL https: //arxiv.org/abs/2303.17604

  3. [3]

    Bolya and J

    D. Bolya and J. Hoffman. Token merging for fast stable diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603, 2023

  4. [4]

    Token Merging: Your ViT But Faster

    D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. Token merging: Your vit but faster, 2023. URLhttps://arxiv.org/abs/2210.09461

  5. [5]

    J. Chen, C. Ge, E. Xie, Y . Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation, 2024

  6. [6]

    J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y . Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International Conference on Learning Representations, 2024

  7. [7]

    Cheng, Z

    X. Cheng, Z. Chen, and Z. Jia. Cat pruning: Cluster-aware token pruning for text-to-image diffusion models, 2025. URLhttps://arxiv.org/abs/2502.00433

  8. [8]

    Generating Long Sequences with Sparse Transformers

    R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers, 2019. URLhttps://arxiv.org/abs/1904.10509

  9. [9]

    G. Fang, X. Ma, and X. Wang. Structural pruning for diffusion models.arXiv preprint arXiv:2305.10924, 2023

  10. [10]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  11. [11]

    J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for high fidelity image generation, 2021. URLhttps://arxiv.org/abs/2106.15282

  12. [12]

    Vbench: Comprehensive benchmark suite for video generative models, 2023

    Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y . Wang, X. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu. VBench: Comprehensive Benchmark Suite for Video Generative Models, Nov. 2023. URL http://arxiv.org/abs/2311.17982. arXiv:2311.17982 [cs]

  13. [13]

    Jeong, K

    W. Jeong, K. Lee, H. Seo, and S. Y . Chun. Training-free mixed-resolution latent upsampling for spatially accelerated diffusion transformers, 2026. URL https://arxiv.org/abs/2507. 08422

  14. [14]

    Ryoo, and Tian Xie

    K. Kahatapitiya, H. Liu, S. He, D. Liu, M. Jia, C. Zhang, M. S. Ryoo, and T. Xie. Adaptive caching for faster video generation with diffusion transformers, 2024. URL https://arxiv. org/abs/2411.02397

  15. [15]

    M. Kim, S. Gao, Y .-C. Hsu, Y . Shen, and H. Jin. Token fusion: Bridging the gap between token pruning and token merging. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383–1392, 2024

  16. [16]

    S. Kim, H. Lee, W. Cho, M. Park, and W. W. Ro. Ditto: Accelerating diffusion model via temporal value similarity. InProceedings of the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2025

  17. [17]

    H. Li, Y . Yang, M. Chang, H. Feng, Z. Xu, Q. Li, and Y . Chen. Srdiff: Single image super- resolution with diffusion probabilistic models, 2021. URL https://arxiv.org/abs/2104. 14951

  18. [18]

    X. Li, Y . Liu, L. Lian, H. Yang, Z. Dong, D. Kang, S. Zhang, and K. Keutzer. Q-diffusion: Quantizing diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17489–17499, 2023. doi: 10.1109/ICCV51070.2023.01608. 10

  19. [19]

    F. Liu, S. Zhang, X. Wang, Y . Wei, H. Qiu, Y . Zhao, Y . Zhang, Q. Ye, and F. Wan. Timestep embedding tells: It’s time to cache for video diffusion model, 2025. URL https://arxiv. org/abs/2411.19108

  20. [20]

    J. Liu, C. Zou, Y . Lyu, J. Chen, and L. Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers, 2025. URLhttps://arxiv.org/abs/2503.06923

  21. [21]

    X. Liu, C. Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023

  22. [22]

    C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

  23. [23]

    C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022

  24. [24]

    W. Lu, S. Zheng, Y . Xia, and S. Wang. Toma: Token merge with attention for diffusion models,

  25. [25]

    URLhttps://arxiv.org/abs/2509.10918

  26. [26]

    Z. Luo, D. Chen, Y . Zhang, Y . Huang, L. Wang, Y . Shen, D. Zhao, J. Zhou, and T. Tan. Videofusion: Decomposed diffusion models for high-quality video generation, 2023. URL https://arxiv.org/abs/2303.08320

  27. [27]

    D. Menn, Y . Yang, B. Wang, X. Wei, M. Munir, F. Liang, R. Marculescu, C. Xu, and D. Mar- culescu. Video compression meets video generation: Latent inter-frame pruning with attention recovery, 2026. URLhttps://arxiv.org/abs/2603.05811

  28. [28]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  29. [29]

    Ronneberger, P

    O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

  30. [30]

    Saharia, J

    C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi. Image super-resolution via iterative refinement, 2021. URLhttps://arxiv.org/abs/2104.07636

  31. [31]

    Progressive Distillation for Fast Sampling of Diffusion Models

    T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022

  32. [32]

    Shang, Z

    Y . Shang, Z. Yuan, B. Xie, B. Wu, and Y . Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981, 2023

  33. [33]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

  34. [34]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021

  35. [35]

    Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023

  36. [36]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    X. Sun et al. Hunyuanvideo: A systematic framework for large video generation models, 2024. URLhttps://arxiv.org/abs/2412.03603

  37. [37]

    Y . Tian, X. Xia, Y . Ren, S. Lin, X. Wang, X. Xiao, Y . Tong, L. Yang, and B. Cui. Training-free diffusion acceleration with bottleneck sampling, 2025. URL https://arxiv.org/abs/2503. 18940. 11

  38. [38]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

  39. [39]

    B. Wu, C. Zou, C. Li, D. Huang, F. Yang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jiang, Linus, Patrol, P. Zhang, P. Chen, P. Zhao, Q. Tian, S. Liu, W. Kong, W. Wang, X. He, X. Li, X. Deng, X. Zhe, Y . Li, Y . Long, Y . Peng, Y . Wu, Y . Liu, Z. Wang, Z. Dai, B. Peng, C. Li, G. Gong, G. Xiao, J. Tian, J. Lin, J. Liu, J. Zhang, J. Lian, K. Pan, L. Wang, L. Niu...

  40. [40]

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, X. Gu, Yuxuan.Zhang, W. Wang, Y . Cheng, B. Xu, Y . Dong, and J. Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=LQzN6TRFg9

  41. [41]

    Z. Yuan, R. Xie, Y . Shang, H. Zhang, S. Wang, S. Yan, G. Dai, and Y . Wang. Vgdfr: Diffusion- based video generation with dynamic latent frame rate, 2025. URL https://arxiv.org/ abs/2504.12259

  42. [42]

    Big Bird: Transformers for Longer Sequences

    M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences, 2021. URL https://arxiv.org/abs/2007.14062

  43. [43]

    Zhang, B

    E. Zhang, B. Xiao, J. Tang, Q. Ma, C. Zou, X. Ning, X. Hu, and L. Zhang. Token pruning for caching better: 9 times acceleration on stable diffusion for free, 2024. URL https: //arxiv.org/abs/2501.00375

  44. [44]

    Zhang, J

    E. Zhang, J. Tang, X. Ning, and L. Zhang. Training-free and hardware-friendly acceleration for diffusion models via similarity-based token pruning. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

  45. [45]

    Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

    J. Zhang, C. Xiang, H. Huang, J. Wei, H. Xi, J. Zhu, and J. Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025. URL https://arxiv. org/abs/2502.18137

  46. [46]

    Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025

    Y . Zhang, J. Xing, B. Xia, S. Liu, B. Peng, X. Tao, P. Wan, E. Lo, and J. Jia. Training-free efficient video generation via dynamic token carving, 2025. URL https://arxiv.org/abs/ 2505.16864

  47. [47]

    Zheng, C

    K. Zheng, C. Lu, J. Chen, and J. Zhu. DPM-solver-v3: Improved diffusion ODE solver with empirical model statistics. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=9fWKExmKa0

  48. [48]

    Zheng, Y

    K. Zheng, Y . Wang, Q. Ma, H. Chen, J. Zhang, Y . Balaji, J. Chen, M.-Y . Liu, J. Zhu, and Q. Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency,

  49. [49]

    URLhttps://arxiv.org/abs/2510.08431

  50. [50]

    Zheng, G

    S. Zheng, G. Chen, Q. Zhou, Y . Lin, L. He, C. Zou, P. Cai, J. Liu, and L. Zhang. Let features decide their own solvers: Hybrid feature caching for diffusion transformers, 2025. URL https://arxiv.org/abs/2510.04188

  51. [51]

    Zheng, L

    S. Zheng, L. Feng, X. Wang, Q. Zhou, P. Cai, C. Zou, J. Liu, Y . Lin, J. Chen, Y . Ma, and L. Zhang. Forecast then calibrate: Feature caching as ode for efficient diffusion transformers,

  52. [52]

    URLhttps://arxiv.org/abs/2508.16211. 12

  53. [53]

    From sketch to fresco: Efficient diffusion transformer with progressive resolution, 2026

    S. Zheng, G. Chen, L. He, J. Liu, Y . Lin, C. Zou, and L. Zhang. From sketch to fresco: Efficient diffusion transformer with progressive resolution, 2026. URL https://arxiv.org/abs/ 2601.07462

  54. [54]

    Zheng, X

    Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You. Open-sora: Democratizing efficient video production for all, March 2024. URL https://github.com/ hpcaitech/Open-Sora

  55. [55]

    H. Zhu, D. Tang, J. Liu, M. Lu, J. Zheng, J. Peng, D. Li, Y . Wang, F. Jiang, L. Tian, S. Tiwari, A. Sirasao, J.-H. Yong, B. Wang, and E. Barsoum. Dip-go: A diffusion pruner via few-step gradient optimization, 2024

  56. [56]

    C. Zou, C. Li, Y . Li, P. Li, J. Wu, X. He, S. Liu, Z. Zhong, K. Huang, and L. Zhang. Disca: Accelerating video diffusion transformers with distillation-compatible learnable feature caching,

  57. [57]

    URLhttps://arxiv.org/abs/2602.05449. 13