pith. sign in

arxiv: 2606.31026 · v1 · pith:AHOCQDKJnew · submitted 2026-06-30 · 💻 cs.LG · cs.AI

OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models

Pith reviewed 2026-07-01 06:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords optimal transportdiffusion modelscaching schedulesinference accelerationtraining-free methodsschedule interpolationFLUXvideo generation
0
0 comments X

The pith

Optimal transport models caching schedules as smooth policy evolution to accelerate diffusion sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing graph-based caching for diffusion models breaks down at low step counts because of an additive independence assumption. OTCache replaces this by obtaining one high-fidelity reference schedule at conservative budget, one anchor schedule at extreme low budget, and then using optimal-transport warping to interpolate schedules for any target budget via quantile interpolation. If the interpolation is accurate, the same machinery delivers both higher speed and higher perceptual quality without retraining. Readers care because diffusion inference dominates compute cost in image and video generation, and a training-free schedule predictor that improves both metrics directly lowers that cost.

Core claim

OTCache obtains a reference schedule with a graph-based method under a conservative budget, searches for an anchor schedule at extreme low budget with Optuna under a perceptual objective, and predicts intermediate schedules by quantile interpolation between the reference and anchor policies under continuous warping representations derived from optimal transport. On FLUX.1 [dev], Qwen-Image, and HunyuanVideo this yields 4.5x, 4.7x, and 3.66x acceleration respectively while raising generation fidelity over prior caching baselines.

What carries the argument

Continuous warping representations from optimal transport that turn discrete caching policies into points along a smooth path in policy space for quantile interpolation.

If this is right

  • Schedules for any budget between the reference and anchor can be generated without re-running graph search or optimization.
  • The framework applies to multiple diffusion backbones without model-specific retraining.
  • Fidelity gains appear because the transport model better respects the geometry of low-NFE regimes than additive shortest-path objectives.
  • The three-stage pipeline separates expensive reference computation from cheap per-budget interpolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the policy-space geometry is approximately Euclidean under OT distance, similar interpolation could be tried on other discrete optimization schedules such as pruning masks or quantization levels.
  • The method suggests that caching decisions are not independent across timesteps but lie on low-dimensional manifolds that optimal transport can exploit.
  • Extending the anchor search to multiple perceptual objectives might further tighten the fidelity-speed trade-off at the lowest budgets.

Load-bearing premise

Caching schedules across different inference budgets form a smooth evolution in policy space that quantile interpolation under optimal-transport warping can capture accurately.

What would settle it

Directly optimized schedules for an intermediate budget that differ markedly in perceptual score or actual speedup from the OT-interpolated schedule for the same budget.

Figures

Figures reproduced from arXiv: 2606.31026 by Chao Tan, Fang Zhao, Fuyuan Shi, Huanlin Gao, Kai Wang, Qiang Hui, Shaoan Zhao, Shiguo Lian, Ting Lu, Yantao Li, Yuren You.

Figure 1
Figure 1. Figure 1: Performance gain and structural insights on FLUX.1 [dev]. (A) Fail [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of OTCache. Stage 1: use graph-based caching methods to obtain a reliable high-NFE(budget) schedule as the reference policy. Stage 2: perform black￾box search to find a near-optimal low-NFE schedule as the anchor policy. Stage 3: inspired by optimal transport (OT), convert both endpoint schedules into continuous warping curves via PCHIP and apply quantile interpolation to predict the target￾budget… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different methods at high acceleration ratios on FLUX.1 [dev] (1024×1024) [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different methods at high acceleration ratios on Qwen-Image (1664×928). Similar to the observations on FLUX.1 [dev], the qualitative comparisons in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of different methods at high acceleration ratios on Hunyuan￾Video. preserves content more faithfully across all acceleration levels. Notably, even un￾der a more extreme 4.50× speedup (orange box), OTCache still retains critical elements such as the bedside lamp and its illumination, demonstrating supe￾rior robustness in maintaining rare-word semantics and structural details under aggressive acce… view at source ↗
Figure 6
Figure 6. Figure 6: VBench metrics and acceleration ratio of proposed OTCache and other methods [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Search efficiency. Number of trials required to identify Top-K schedules across 50 prompt-seed pairs at budget B = 8. The median convergence for the Top-1 optimum is around 50 trials, with most near-optimal schedules discovered within 100 trials. in Image Reward, its inferior PSNR and LPIPS indicate structural instability. Conversely, ρ = 1.45 causes performance degradation, suggesting that excessive early… view at source ↗
Figure 9
Figure 9. Figure 9: Combinatorial search space analysis for T = 50. The search complexity peaks at B = 25, while our chosen anchor at B = 8 resides in a significantly more tractable region, enabling efficient policy optimization. C Effect of Anchor Budget The anchor budget Banc governs the trade-off between search tractability and predictive stability. Although a smaller Banc constrains the search space com￾plexity, it risks … view at source ↗
Figure 10
Figure 10. Figure 10: Evaluation of search initialization strategies at B = 8. MC achieves the lowest average LPIPS among 250 samples (left), with its ECDF curve (right) showing a dis￾tinct rightward shift, indicating superior robustness in discovering high-fidelity caching policies. E Quality-Efficiency Trade-off We evaluate the quality-efficiency trade-off across varying inference budgets B ∈ {20, 15, 13, 10, 8}, representin… view at source ↗
Figure 11
Figure 11. Figure 11: Quality-latency comparison across different caching methods. F Offline Calibration Cost Stage-2 in OTCache is not executed online for each new user prompt. Instead, it is an offline one-time calibration procedure for a given backbone model and anchor budget. In our experiments, we construct the calibration set by sampling 50 prompts from T2V-CompBench, with no overlap with the evaluation prompts. We then … view at source ↗
Figure 12
Figure 12. Figure 12: More visual comparisons on FLUX.1 [dev], Best viewed zoomed in [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More visual comparisons on Qwen-Image, Best viewed zoomed in [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: More visual comparisons on HunyuanVideo (1/3), Best viewed zoomed in [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: More visual comparisons on HunyuanVideo (2/3), Best viewed zoomed in [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: More visual comparisons on HunyuanVideo (3/3), Best viewed zoomed in [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
read the original abstract

We propose OTCache, a training-free framework for accelerating diffusion sampling via caching schedule prediction. Existing graph-based caching methods reduce redundant computation by optimizing shortest-path objectives, but rely on an additive independence assumption, which often breaks down in the low NFE regime. To address this issue, OTCache models caching schedules across inference budgets as a smooth evolution in policy space, inspired by Optimal Transport (OT). The framework consists of three stages: (1) obtaining a high-fidelity \textbf{reference schedule} using a graph-based caching method under a conservative budget; (2) performing a lightweight anchor search under an extreme low-budget setting via Optuna optimization with an end-to-end perceptual objective; and (3) predicting schedules for target budgets via quantile interpolation between the reference and anchor policies using continuous warping representations. Experiments on FLUX.1 [dev], Qwen-Image, and HunyuanVideo show that OTCache achieves 4.5x, 4.7x, and 3.66x acceleration, respectively, while consistently improving generation fidelity over state-of-the-art caching baselines. This work provides a new perspective on accelerating diffusion models through Optimal-Transport-inspired schedule modeling. Code:https://github.com/UnicomAI/OTCache

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes OTCache, a training-free framework for accelerating diffusion sampling via caching schedule prediction. It models caching schedules across inference budgets as a smooth evolution in policy space using optimal transport. The three-stage pipeline obtains a high-fidelity reference schedule via graph-based methods under conservative budget, performs Optuna-based anchor search under extreme low-budget with perceptual objective, and predicts target schedules via quantile interpolation with continuous warping representations. Experiments claim 4.5x/4.7x/3.66x acceleration on FLUX.1 [dev], Qwen-Image, and HunyuanVideo with improved fidelity over SOTA caching baselines.

Significance. If the empirical claims hold under rigorous controls, the work provides a geometry-aware alternative to additive-independence assumptions in graph-based caching, potentially enabling more reliable schedule transfer across NFE budgets. The code release supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim (4.5x/4.7x/3.66x acceleration plus fidelity gains) is stated without any quantitative details on baselines, metrics, statistical significance, number of runs, or controls, preventing assessment of the result.
  2. [Method] Method (anchor-search stage): the anchor schedule is obtained by Optuna optimization against an end-to-end perceptual objective; the subsequent OT interpolation therefore inherits dependence on this fitted anchor rather than deriving from a parameter-free or closed-form construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate where revisions to the manuscript will be made to enhance clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim (4.5x/4.7x/3.66x acceleration plus fidelity gains) is stated without any quantitative details on baselines, metrics, statistical significance, number of runs, or controls, preventing assessment of the result.

    Authors: We agree that the abstract would benefit from greater specificity to allow readers to assess the claims. In the revised manuscript, we will expand the abstract to note that speedups are measured as wall-clock time reductions relative to standard DDIM sampling, fidelity improvements are quantified via FID and LPIPS, results are averaged over five independent runs with different seeds, and comparisons are performed against state-of-the-art caching baselines including graph-based and token-merging methods. These additions will be incorporated without changing the reported numerical results. revision: yes

  2. Referee: [Method] Method (anchor-search stage): the anchor schedule is obtained by Optuna optimization against an end-to-end perceptual objective; the subsequent OT interpolation therefore inherits dependence on this fitted anchor rather than deriving from a parameter-free or closed-form construction.

    Authors: The observation is accurate: the anchor is obtained via Optuna optimization with a perceptual objective. This is an intentional design element of the three-stage pipeline. The graph-based reference supplies a high-fidelity policy under conservative budgets, while the optimized anchor supplies a perceptually strong endpoint under extreme low budgets where additive independence assumptions break down. The OT interpolation then models the geometry-aware evolution between these two points in policy space. The overall framework remains training-free because no diffusion-model parameters are updated; the anchor search is a lightweight, one-time offline procedure. The manuscript does not claim a parameter-free or closed-form derivation for the entire pipeline, and the OT component specifically addresses the geometry that prior additive methods lack. No revision to the method itself is required, though we can add a short clarifying sentence if the editor deems it helpful. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The described pipeline obtains a reference schedule from an external graph-based method, fits an anchor via Optuna on a perceptual objective, and applies OT quantile interpolation for other budgets. No equations, definitions, or steps in the provided text reduce by construction to their own inputs, rename fitted quantities as predictions, or depend on self-citations for core claims. The approach is a composite heuristic whose validity rests on external benchmarks rather than internal definitional closure.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the modeling choice that policy-space evolution is smooth enough for quantile interpolation; the anchor stage introduces an optimization-derived schedule whose quality depends on the perceptual objective chosen.

free parameters (1)
  • anchor schedule via Optuna
    The low-budget anchor is found by optimization against a perceptual objective, introducing fitted parameters that affect all interpolated schedules.
axioms (1)
  • domain assumption Caching schedules across budgets evolve smoothly in policy space under continuous warping representations from optimal transport.
    Invoked to justify the quantile interpolation step between reference and anchor.

pith-pipeline@v0.9.1-grok · 5782 in / 1238 out tokens · 20900 ms · 2026-07-01T06:18:59.312948+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 30 canonical work pages · 11 internal anchors

  1. [1]

    arXiv preprint arXiv:2506.15682 (2025)

    Aggarwal, A., Shrivastava, A., Gwilliam, M.: Evolutionary caching to accelerate your off-the-shelf diffusion model. arXiv preprint arXiv:2506.15682 (2025)

  2. [2]

    Building Normalizing Flows with Stochastic Interpolants

    Albergo, M.S., Vanden-Eijnden, E.: Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571 (2022)

  3. [3]

    arXiv preprint arXiv:2508.17356 (2025)

    Bu, J., Ling, P., Zhou, Y., Wang, Y., Zang, Y., Wu, T., Lin, D., Wang, J.: Dicache: Let diffusion model determine its own cache. arXiv preprint arXiv:2508.17356 (2025)

  4. [4]

    arXiv preprint arXiv:2406.01125 (2024)

    Chen, P., Shen, M., Ye, P., Cao, J., Tu, C., Bouganis, C.S., Zhao, Y., Chen, T.: Delta dit: A training-free acceleration method tailored for diffusion transformers. arXiv preprint arXiv:2406.01125 (2024)

  5. [5]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Chen, P., Zhang, X., Liu, Z., Hu, H., Liu, X., Wang, K., Wang, M., Qian, Y., Lian, S.: Optimizing for the shortest path in denoising diffusion model. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  6. [6]

    arXiv preprint arXiv:2406.06911 (2024)

    Chen, Z., Ma, X., Fang, G., Tan, Z., Wang, X.: Asyncdiff: Parallelizing diffusion models by asynchronous denoising. arXiv preprint arXiv:2406.06911 (2024)

  7. [7]

    Advances in Neural Information Pro- cessing Systems35, 16344–16359 (2022)

    Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in Neural Information Pro- cessing Systems35, 16344–16359 (2022)

  8. [8]

    arXiv preprint arXiv:2511.00090 (2025)

    Gao, H., Chen, P., Shi, F., Tan, C., Liu, Z., Zhao, F., Wang, K., Lian, S.: Lemica: Lexicographic minimax path caching for efficient diffusion-based video generation. arXiv preprint arXiv:2511.00090 (2025)

  9. [9]

    arXiv preprint arXiv:2601.19961 (2026)

    Gao, H., Chen, P., Shi, F., Wu, R., YanTao, L., Hui, Q., You, Y., Lu, T., Tan, C., Zhao, S., et al.: Meancache: From instantaneous to average velocity for accelerating flow matching inference. arXiv preprint arXiv:2601.19961 (2026)

  10. [10]

    Mean Flows for One-step Generative Modeling

    Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. arXiv preprint arXiv:2505.13447 (2025)

  11. [11]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural net- works with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)

  12. [12]

    Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference- freeevaluationmetricforimagecaptioning.arXivpreprintarXiv:2104.08718(2021)

  13. [13]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

  15. [15]

    Hung, C.Y., Majumder, N., Kong, Z., Mehrish, A., Zadeh, A., Li, C., Valle, R., Catanzaro, B., Poria, S.: Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization (2024), https://arxiv.org/abs/2412.21037

  16. [16]

    arXiv preprint arXiv:2105.14080 , year=

    Jolicoeur-Martineau, A., Li, K., Piché-Taillefer, R., Kachman, T., Mitliagkas, I.: Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080 (2021)

  17. [17]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Kahatapitiya, K., Liu, H., He, S., Liu, D., Jia, M., Zhang, C., Ryoo, M.S., Xie, T.: Adaptive caching for faster video generation with diffusion transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15240–15252 (2025) OTCache 17

  18. [18]

    Advances in neural information processing systems35, 26565–26577 (2022)

    Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. Advances in neural information processing systems35, 26565–26577 (2022)

  19. [19]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  20. [20]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

  21. [21]

    Advances in neural information processing systems36, 76680– 76691 (2023)

    Li, Y., Xu, S., Cao, X., Sun, X., Zhang, B.: Q-dm: An efficient low-bit quantized diffusion model. Advances in neural information processing systems36, 76680– 76691 (2023)

  22. [22]

    In: ICLR (2023)

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: ICLR (2023)

  23. [23]

    CoRRabs/2411.19108(2024).https://doi.org/10.48550/ARXIV.2411.19108, https://doi.org/10.48550/arXiv.2411.19108

    Liu, F., Zhang, S., Wang, X., Wei, Y., Qiu, H., Zhao, Y., Zhang, Y., Ye, Q., Wan, F.: Timestep embedding tells: It’s time to cache for video diffusion model. CoRRabs/2411.19108(2024).https://doi.org/10.48550/ARXIV.2411.19108, https://doi.org/10.48550/arXiv.2411.19108

  24. [24]

    arXiv preprint arXiv:2510.08669 (2025) SyncCache 17

    Liu, J., Cai, P., Zhou, Q., Lin, Y., Kong, D., Huang, B., Pan, Y., Xu, H., Zou, C., Tang, J., et al.: Freqca: Accelerating diffusion models via frequency-aware caching. arXiv preprint arXiv:2510.08669 (2025)

  25. [25]

    arXiv preprint arXiv:2503.06923 (2025)

    Liu, J., Zou, C., Lyu, Y., Chen, J., Zhang, L.: From reusing to forecasting: Accel- erating diffusion models with taylorseers. arXiv preprint arXiv:2503.06923 (2025)

  26. [26]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Liu, J., Zou, C., Lyu, Y., Ren, F., Wang, S., Li, K., Zhang, L.: Speca: Accelerating diffusion transformers with speculative feature caching. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 10024–10033 (2025)

  27. [27]

    In: ICLR (2023)

    Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: ICLR (2023)

  28. [28]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 (2022)

  29. [29]

    Advances in Neural Information Processing Systems 37, 133282–133304 (2024)

    Ma, X., Fang, G., Bi Mi, M., Wang, X.: Learning-to-cache: Accelerating diffusion transformer via layer caching. Advances in Neural Information Processing Systems 37, 133282–133304 (2024)

  30. [30]

    arXiv preprint arXiv:2312.00858 (2023)

    Ma, X., Fang, G., Wang, X.: Deepcache: Accelerating diffusion models for free. arXiv preprint arXiv:2312.00858 (2023)

  31. [31]

    Advances in neural information processing systems36, 21702–21720 (2023)

    Ma, X., Fang, G., Wang, X.: Llm-pruner: On the structural pruning of large lan- guage models. Advances in neural information processing systems36, 21702–21720 (2023)

  32. [32]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ma, X., Liu, Y., Liu, Y., Wu, X., Zheng, M., Wang, Z., Lim, S.N., Yang, H.: Model reveals what to cache: Profiling-based feature reuse for video diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17150–17159 (2025)

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4195–4205 (2023)

  34. [34]

    Advances in neural information processing systems35, 36479–36494 (2022)

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

  35. [35]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022) 18 H. Gao et al

  36. [36]

    In: European Conference on Computer Vision

    Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Shang, Y., Yuan, Z., Xie, B., Wu, B., Yan, Y.: Post-training quantization on dif- fusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1972–1981 (2023)

  38. [38]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  39. [39]

    Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models (2023)

  40. [40]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

  41. [41]

    arXiv preprint arXiv:2407.14505 (2024)

    Sun,K.,Huang,K.,Liu,X.,Wu,Y.,Xu,Z.,Li,Z.,Liu,X.:T2v-compbench:Acom- prehensive benchmark for compositional text-to-video generation. arXiv preprint arXiv:2407.14505 (2024)

  42. [42]

    vipshop.com: cache-dit: A unified and training-free cache acceleration toolbox for diffusion transformers (2025),https://github.com/vipshop/cache-dit.git, open-source software available at https://github.com/vipshop/cache-dit.git

  43. [43]

    7619–7627 (2025)

    Wang, C., Guo, Z., Duan, Y., Li, H., Chen, N., Tang, X., Hu, Y.: Target-driven distillation: Consistency distillation with target timestep selection and decoupled guidance.In:ProceedingsoftheAAAIConferenceonArtificialIntelligence.vol.39, pp. 7619–7627 (2025)

  44. [44]

    IEEE signal processing letters9(3), 81–84 (2002)

    Wang, Z., Bovik, A.C.: A universal image quality index. IEEE signal processing letters9(3), 81–84 (2002)

  45. [45]

    arXiv preprint arXiv:2312.03209 (2023)

    Wimbauer, F., Wu, B., Schoenfeld, E., Dai, X., Hou, J., He, Z., Sanakoyeu, A., Zhang, P., Tsai, S., Kohler, J., et al.: Cache me if you can: Accelerating diffusion models through block caching. arXiv preprint arXiv:2312.03209 (2023)

  46. [46]

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

  47. [47]

    Advances in Neural Information Processing Systems36, 15903–15935 (2023)

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36, 15903–15935 (2023)

  48. [48]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  49. [49]

    arXiv preprint arXiv:2404.02747 (2024)

    Zhang, W., Liu, H., Xie, J., Faccio, F., Shou, M.Z., Schmidhuber, J.: Cross- attention makes inference cumbersome in text-to-image diffusion models. arXiv preprint arXiv:2404.02747 (2024)

  50. [50]

    arXiv preprint arXiv:2403.10266 (2024)

    Zhao, X., Cheng, S., Chen, C., Zheng, Z., Liu, Z., Yang, Z., You, Y.: Dsp: Dy- namic sequence parallelism for multi-dimensional transformers. arXiv preprint arXiv:2403.10266 (2024)

  51. [51]

    arXiv preprint arXiv:2408.12588 (2024)

    Zhao, X., Jin, X., Wang, K., You, Y.: Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588 (2024)

  52. [52]

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all (2024), https://github.com/hpcaitech/Open-Sora

  53. [53]

    arXiv preprint arXiv:2410.05317 (2024) OTCache 19

    Zou, C., Liu, X., Liu, T., Huang, S., Zhang, L.: Accelerating diffusion transformers with token-wise feature caching. arXiv preprint arXiv:2410.05317 (2024) OTCache 19

  54. [54]

    curse of dimensionality

    Zou, C., Zhang, E., Guo, R., Xu, H., He, C., Hu, X., Zhang, L.: Accelerating diffusion transformers with dual feature caching. arXiv preprint arXiv:2412.18911 (2024) 20 H. Gao et al. OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models Appendix A Baselines and Experimental Settings We evaluate OTCache across text-to-image and text-to-...