pith. sign in

arxiv: 2605.23381 · v1 · pith:KREE7F5Snew · submitted 2026-05-22 · 💻 cs.CV

VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

Pith reviewed 2026-05-25 04:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords rectified flow modelstraining-free accelerationvelocity decompositioninference optimizationimage generationvideo generationgenerative modelsdenoising acceleration
0
0 comments X

The pith

VDE accelerates rectified flow models by decomposing velocity into parallel and orthogonal components to the input for estimation without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Velocity Decomposition and Estimation (VDE) to speed up inference in rectified flow models for image and video generation. Prior methods cache and reuse features from earlier steps, but these caches mismatch as the input changes during generation. VDE instead splits the velocity output into parts aligned with and perpendicular to the current input, using their observed predictability over time and stable directions to estimate future values adaptively. Full model passes are inserted periodically to reset accumulated errors. Experiments show this yields speedups such as 3.22 times on Flux with an LPIPS of 0.069 on Qwen-Image, better than prior baselines.

Core claim

VDE decomposes the model's velocity into components parallel and orthogonal to the input, exploits their temporal predictability and directional stability for precise input-adaptive estimation without training, and inserts periodic full forward passes to anchor the state and prevent error accumulation, enabling faster inference in rectified flow models while preserving output fidelity.

What carries the argument

Decomposition of velocity into input-parallel and input-orthogonal components, estimated from their temporal predictability and directional stability with periodic anchoring via full passes.

If this is right

  • Rectified flow models achieve multi-fold inference speedup on image and video tasks with only minor quality degradation measured by LPIPS.
  • No model retraining or fine-tuning is required to apply the acceleration.
  • The approach outperforms feature-caching baselines by reducing the mismatch between cached states and evolving inputs.
  • Periodic full passes limit error growth, allowing longer acceleration sequences without collapse in fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition principle could be tested on non-rectified diffusion models to check if parallel-orthogonal stability generalizes.
  • Measuring how the angle between velocity and input evolves across steps might yield a diagnostic for when estimation remains reliable.
  • Applying VDE to conditional generation with text or layout inputs could reveal whether the stability holds under stronger guidance.

Load-bearing premise

The velocity components parallel and orthogonal to the input possess temporal predictability and directional stability that permit accurate estimation from prior steps without training.

What would settle it

Compare the true velocity vectors computed by the full model against VDE's estimated vectors at multiple denoising steps on the same input sequence; large consistent deviations would cause visible quality loss.

Figures

Figures reproduced from arXiv: 2605.23381 by Hongyuan Chen, Jinglin Liang, Junwen Tan, Shuangping Huang.

Figure 1
Figure 1. Figure 1: Qualitative comparison between VDE and standard 50-step sampling across Flux, Qwen-Image, and Wan2.1. VDE achieves [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between standard feature caching and our VDE. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Temporal dynamics of velocity components. Evolution of the decomposed velocity components across Flux, Qwen-Image, and Wan2.1. Top: The parallel (αt) and orthogonal (βt) coefficients. Bottom: The cosine similarity of adjacent orthogonal directions (ut). After an initial warm-up phase (shaded), the scalar coefficients evolve smoothly with strong local linearity, while the orthogonal direction remains highly… view at source ↗
Figure 4
Figure 4. Figure 4: Quantitative evidence for the temporal regularities. We [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of caching-based methods and our VDE on Flux. VDE preserves both global structure and fine details. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of caching-based methods and our VDE on Flux, Qwen-Image, and Wan2.1. VDE successfully acceler [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Prior acceleration methods reuse cached features from previous steps, which neglects the growing mismatch between static caches and the evolving input, leading to reduced output fidelity. This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating. Specifically, VDE decomposes the model's velocity into components parallel and orthogonal to the input, exploiting their temporal predictability and directional stability for precise, input-adaptive estimation. To prevent error accumulation, it periodically anchors the model's state via full forward passes. Extensive experiments on image and video generation tasks demonstrate that VDE achieves substantial acceleration with minimal loss in visual quality. Notably, VDE accelerates Flux by 3.22 times and achieves an LPIPS of 0.069 on Qwen-Image, outperforming the best baseline with a 52.2% reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration technique for rectified flow models. It decomposes the velocity into components parallel and orthogonal to the input x, estimates future values from claimed temporal predictability and directional stability of those components, and periodically performs full forward passes to reset state and limit error accumulation. Experiments on image and video tasks report speedups including 3.22× on Flux and LPIPS=0.069 on Qwen-Image, outperforming baselines by 52.2%.

Significance. If the predictability and stability properties of the decomposed velocity components can be rigorously established, the method would offer a practical training-free route to faster inference in large rectified-flow models while preserving output fidelity, which is valuable for deployment in image and video generation pipelines.

major comments (2)
  1. [Abstract / method description] Abstract and method description: the central claim that the parallel and orthogonal velocity components possess 'temporal predictability and directional stability' permitting accurate input-adaptive estimation is asserted without derivation, bound, or analysis showing why these properties hold across denoising steps or different rectified-flow models. This assumption is load-bearing for the training-free estimation step and error-control argument.
  2. [Experiments] Experiments section: no ablation studies, error analysis, or protocol details are supplied to isolate whether the reported 3.22× speedup on Flux and LPIPS=0.069 on Qwen-Image arise from the decomposition/estimation or from the periodic full passes and other implementation choices. This prevents verification that the quantitative gains support the core claim.
minor comments (2)
  1. [Method] Notation for the parallel/orthogonal decomposition (v_∥ and v_⊥) should be introduced with an explicit equation early in the method section for clarity.
  2. [Abstract / Experiments] The abstract mentions 'extensive experiments on image and video generation tasks' but does not list the exact models, datasets, or metrics used beyond the two highlighted examples; a table summarizing all evaluated settings would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract / method description] Abstract and method description: the central claim that the parallel and orthogonal velocity components possess 'temporal predictability and directional stability' permitting accurate input-adaptive estimation is asserted without derivation, bound, or analysis showing why these properties hold across denoising steps or different rectified-flow models. This assumption is load-bearing for the training-free estimation step and error-control argument.

    Authors: We agree that the manuscript motivates the properties of temporal predictability and directional stability from the structure of rectified flows but does not supply a formal derivation or bounds. The decomposition follows directly from projecting the velocity onto the input direction and its orthogonal complement, and the stability arises because the parallel component tracks the data manifold while the orthogonal tracks residual noise. In revision we will add a short analysis subsection with (i) a Lipschitz-based argument on why the components evolve more slowly than the full velocity and (ii) empirical plots of component-wise change across steps and models to support the estimation procedure. revision: yes

  2. Referee: [Experiments] Experiments section: no ablation studies, error analysis, or protocol details are supplied to isolate whether the reported 3.22× speedup on Flux and LPIPS=0.069 on Qwen-Image arise from the decomposition/estimation or from the periodic full passes and other implementation choices. This prevents verification that the quantitative gains support the core claim.

    Authors: We concur that the current experiments do not fully isolate the contribution of the decomposition-and-estimation step from the periodic anchoring. In the revised version we will add (i) an ablation that disables the adaptive estimation while retaining anchoring, (ii) per-step error curves (L2 velocity error and LPIPS) with and without anchoring, and (iii) explicit protocol details on anchoring frequency, estimation horizon, and implementation choices. These additions will allow readers to attribute the reported speed-ups and fidelity metrics to the proposed VDE components. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on stated velocity properties without reduction to fitted inputs or self-citations.

full rationale

The paper's core step decomposes velocity into parallel/orthogonal components to the input and estimates future values from claimed temporal predictability and directional stability. This is presented as an exploitation of inherent rectified-flow properties rather than any fitted parameter, self-referential prediction, or load-bearing self-citation. No equations reduce the estimation to the inputs by construction, and the training-free framing avoids any internal fitting loop. The provided text contains no self-citations or ansatz smuggling that would trigger the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract alone; the central claim rests on the unverified assumption that parallel and orthogonal velocity components exhibit the required predictability and stability, but no explicit free parameters, axioms, or invented entities are enumerated in the provided text.

pith-pipeline@v0.9.0 · 5716 in / 1110 out tokens · 22584 ms · 2026-05-25T04:55:34.007471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 28 canonical work pages · 17 internal anchors

  1. [1]

    Flux.https://github.com/bla ck-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/bla ck-forest-labs/flux, 2024. 1, 2, 5, 7

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

  3. [3]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025. 2

  4. [4]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 2

  5. [5]

    Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024. 2

  6. [6]

    Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen.δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125,

  7. [7]

    Omnicache: A trajectory-oriented global perspective on training-free cache reuse for diffusion transformer models

    Huanpeng Chu, Wei Wu, Guanyu Feng, and Yutao Zhang. Omnicache: A trajectory-oriented global perspective on training-free cache reuse for diffusion transformer models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16302–16312, 2025. 1, 3

  8. [8]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  9. [9]

    Monocular and generalizable gaussian talking head animation

    Shengjie Gong, Haojie Li, Jiapeng Tang, Dongming Hu, Shuangping Huang, Hao Chen, Tianshui Chen, and Zhuo- man Liu. Monocular and generalizable gaussian talking head animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5523– 5534, 2025. 5

  10. [10]

    Ptqd: Accurate post-training quantiza- tion for diffusion models.arXiv preprint arXiv:2305.10657,

    Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Ptqd: Accurate post-training quantiza- tion for diffusion models.arXiv preprint arXiv:2305.10657,

  11. [11]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  12. [12]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 6

  13. [13]

    Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

    Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high- fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025. 1, 2

  14. [14]

    Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

  15. [15]

    Adaptive caching for faster video generation with diffu- sion transformers

    Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffu- sion transformers. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 15240–15252,

  16. [16]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2

  17. [17]

    Faster diffusion: Rethinking the role of unet encoder in diffusion models.CoRR, 2023

    Senmao Li, Taihang Hu, Fahad Shahbaz Khan, Linxuan Li, Shiqi Yang, Yaxing Wang, Ming-Ming Cheng, and Jian Yang. Faster diffusion: Rethinking the role of unet encoder in diffusion models.CoRR, 2023. 1, 3

  18. [18]

    Craftsman3d: High-fidelity mesh generation with 3d native diffusion and interactive geometry refiner

    Weiyu Li, Jiarui Liu, Hongyu Yan, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman3d: High-fidelity mesh generation with 3d native diffusion and interactive geometry refiner. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5307– 5317, 2025. 1, 2

  19. [19]

    Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025

    Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025. 1, 2

  20. [20]

    Snap- fusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Pro- cessing Systems, 36:20662–20678, 2023

    Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snap- fusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Pro- cessing Systems, 36:20662–20678, 2023. 2

  21. [21]

    Q-dm: An efficient low-bit quantized dif- fusion model.Advances in neural information processing systems, 36:76680–76691, 2023

    Yanjing Li, Sheng Xu, Xianbin Cao, Xiao Sun, and Baochang Zhang. Q-dm: An efficient low-bit quantized dif- fusion model.Advances in neural information processing systems, 36:76680–76691, 2023. 2

  22. [22]

    TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

    Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025. 1, 2

  23. [23]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 2

  24. [24]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 6

  25. [25]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3

  26. [26]

    Timestep embedding tells: It’s time to cache for video diffusion model

    Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7353–7363, 2025. 1, 3, 5, 7

  27. [27]

    Faster diffu- sion via temporal attention decomposition.arXiv preprint arXiv:2404.02747, 2024

    Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Fac- cio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan- Manuel Perez-Rua, and J ¨urgen Schmidhuber. Faster diffu- sion via temporal attention decomposition.arXiv preprint arXiv:2404.02747, 2024. 1, 3, 5

  28. [28]

    Instaflow: One step is enough for high-quality diffusion- based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion- based text-to-image generation. InThe Twelfth International Conference on Learning Representations, 2023. 2

  29. [29]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 2

  30. [30]

    Deepcache: Accelerating diffusion models for free

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024. 1, 3

  31. [31]

    On distillation of guided diffusion models

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14297–14306, 2023. 2

  32. [32]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 2

  33. [33]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  34. [34]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

  35. [35]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

  36. [36]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  37. [37]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2, 3

  38. [38]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 2

  39. [39]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer,

  40. [40]

    Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,

    Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,

  41. [41]

    Post-training quantization on diffusion models

    Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1972–1981, 2023. 2

  42. [42]

    Temporal dynamic quantization for dif- fusion models.Advances in neural information processing systems, 36:48686–48698, 2023

    Junhyuk So, Jungwon Lee, Daehyun Ahn, Hyungjun Kim, and Eunhyeok Park. Temporal dynamic quantization for dif- fusion models.Advances in neural information processing systems, 36:48686–48698, 2023. 2

  43. [43]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2

  44. [44]

    Improved Techniques for Training Consistency Models

    Yang Song and Prafulla Dhariwal. Improved tech- niques for training consistency models.arXiv preprint arXiv:2310.14189, 2023. 2

  45. [45]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 2

  46. [46]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 5, 7

  47. [47]

    Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109,

    Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109,

  48. [48]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 1, 2, 5, 7

  49. [49]

    Structured 3d latents for scalable and versatile 3d gen- eration

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 1, 2

  50. [50]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 5

  51. [51]

    Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

    Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

  52. [52]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5

  53. [53]

    Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024

    Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024. 1, 3, 5, 7

  54. [54]

    Mobilediffusion: Subsecond text-to-image generation on mobile devices.arXiv preprint arXiv:2311.16567, 2(3):4,

    Yang Zhao, Yanwu Xu, Zhisheng Xiao, and Tingbo Hou. Mobilediffusion: Subsecond text-to-image generation on mobile devices.arXiv preprint arXiv:2311.16567, 2(3):4,

  55. [55]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 2

  56. [56]

    Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching

    Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Heng- shuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching. arXiv preprint arXiv:2507.02860, 2025. 1, 3, 5, 7