Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

Hangyeol Lee; Hyojeong Lee; Joo-Young Kim

arxiv: 2605.22011 · v1 · pith:YA7ZP5ROnew · submitted 2026-05-21 · 💻 cs.CV

Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

Hangyeol Lee , Hyojeong Lee , Joo-Young Kim This is my paper

Pith reviewed 2026-05-22 07:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords token reductiondiffusion transformersoutput similarityimage generationcomputational efficiencyDiTo

0 comments

The pith

Diffusion transformers can reduce tokens by matching output similarities from prior steps as proxies rather than input similarities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that token reduction for diffusion transformers should center on minimizing output recovery error instead of relying on input token similarity from earlier vision transformer methods. It observes that output token similarities hold steady across nearby generation timesteps, allowing correspondences computed at selected matching steps to be reused over several following reduction steps. An interleaved schedule based on pair match ratio controls how often to recompute these matches, while a frequency penalty reduces repeated approximations that create visible artifacts. When this holds, the method produces higher-fidelity images at the same computational budget than prior reduction techniques.

Core claim

DiTo shifts token reduction to an output-centric view by treating preserved output similarities across adjacent timesteps as a reliable proxy: similarities measured at a matching timestep establish token correspondences that are then applied unchanged across multiple subsequent reduction timesteps, with pair-match-ratio scheduling setting the reuse interval and a selection-frequency penalty correcting for localized errors.

What carries the argument

Preservation of output token similarity across adjacent timesteps, used as a proxy to set token correspondences at matching steps for reuse in reduction steps.

If this is right

Image quality measured by PSNR rises 1.6 to 3.9 dB above existing token-reduction baselines at matching speedup factors.
The quality-speed tradeoff curve improves, placing the method on a better Pareto frontier than prior approaches.
Repeated reuse of the same correspondences is kept from creating blocking artifacts by penalizing high-frequency token selections.
The overall schedule balances matching cost against reduction savings through the pair-match-ratio metric.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same output-similarity proxy idea could be tested in other iterative generative processes such as flow-based or autoregressive models.
If output similarity patterns prove consistent in video or 3D generation, the reuse schedule might extend to those domains with only minor changes.
The frequency penalty suggests a general way to control approximation error when any similarity measure is reused across steps.

Load-bearing premise

Output token similarities stay stable enough from one timestep to the next that earlier measurements can stand in for current ones without large errors.

What would settle it

Direct computation of output token similarities at consecutive timesteps shows large changes in which tokens are similar, causing the proxy correspondences to produce visible quality drops.

Figures

Figures reproduced from arXiv: 2605.22011 by Hangyeol Lee, Hyojeong Lee, Joo-Young Kim.

**Figure 2.** Figure 2: Analysis of recovery error and performance trade-offs. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of token reduction methods. (a) Existing input-based methods rely on current input token similarity. (b) The output-based method utilizes prior output token similarity, minimizing recovery error and enhancing generation quality. stages by specifying how tokens are to be reduced and subsequently restored. Existing TR methods [2, 14, 24, 37, 45] inherit ViT-style paradigms that rely on input toke… view at source ↗

**Figure 4.** Figure 4: Quantitative analysis of token alignment and DiTo pipeline. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of token selection frequency in TR on image quality. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization results of DiTo under varying reduction ratios. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Diffusion Transformers (DiTs) achieve superior image generation quality but suffer from quadratic computational complexity relative to token count. While various token reduction (TR) methods have been proposed to mitigate this cost, they overlook the primary objective of generative models: minimizing recovery error, which requires reflecting output token similarity. They rely solely on input token similarity inherited from reduction-only ViT paradigms, leading to a fundamental misalignment with this objective. To bridge this gap, we propose DiTo, a novel TR paradigm that shifts the focus toward output-centric token reduction. Based on the observation that output token similarity is consistently preserved across adjacent timesteps, DiTo utilizes prior-step similarities as an effective proxy to establish token correspondences at a Matching timestep, which are then reused across multiple subsequent Reduction timesteps. To optimize this interleaved scheduling, we propose Pair Match Ratio (PMR)-guided Interval Scheduling to determine the optimal matching frequency. Furthermore, to mitigate localized approximation errors and resulting blocking artifacts caused by repeated reuse, we propose Frequency-aware Token Matching by incorporating a selection-frequency penalty. Extensive experiments demonstrate that DiTo consistently outperforms existing TR methods with 1.6-3.9 dB higher PSNR at comparable speedups, achieving a superior Pareto frontier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiTo pushes token reduction in diffusion models toward output similarities with proxy reuse across timesteps plus scheduling and frequency penalties, but the gains rest on an unquantified stability assumption and high-level experiment summaries.

read the letter

The main takeaway is that this work tries to fix a mismatch in token reduction for Diffusion Transformers by focusing on output token similarity instead of input similarity. They note that output similarities hold steady enough from one timestep to the next to reuse them as proxies for matching, then apply those matches over several reduction steps. To make the reuse practical they add PMR-guided Interval Scheduling to pick good matching intervals and a frequency-aware penalty to avoid over-selecting the same tokens and creating artifacts.

Referee Report

2 major / 2 minor

Summary. The paper proposes DiTo, an output-centric token reduction paradigm for Diffusion Transformers. It observes that output token similarities are preserved across adjacent timesteps and reuses prior-step similarity matrices as proxies to establish token correspondences at a Matching timestep, which are then applied over multiple Reduction timesteps. PMR-guided Interval Scheduling determines matching frequency, while Frequency-aware Token Matching adds a selection-frequency penalty to reduce blocking artifacts. The central claim is that DiTo achieves 1.6-3.9 dB higher PSNR than prior TR methods at comparable speedups and a superior Pareto frontier.

Significance. If the performance claims and underlying similarity-preservation assumption hold under rigorous testing, DiTo would represent a meaningful shift from input-similarity-based TR methods toward alignment with the generative objective of minimizing recovery error. This could improve practical efficiency of DiT inference for high-resolution synthesis without proportional quality loss.

major comments (2)

[Abstract / DiTo design paragraph] Abstract and DiTo design description: the core assumption that 'output token similarity is consistently preserved across adjacent timesteps' so that prior-step similarities serve as reliable proxies is stated without any supporting quantification (e.g., per-timestep cosine similarity statistics, correlation coefficients between consecutive output similarity matrices, or ablation on proxy misalignment). This assumption is load-bearing for the interleaved Matching/Reduction schedule and the claimed PSNR gains; without it the reuse mechanism risks accumulating correspondence errors, especially in later timesteps or high-frequency regions.
[Experiments] Experimental results section: the reported 1.6-3.9 dB PSNR improvements and superior Pareto frontier are presented without reference to specific baselines, number of random seeds, statistical significance tests, exact model configurations (e.g., DiT-XL/2 at 512×512), or failure-case analysis. This makes it impossible to verify whether the gains are robust or whether the frequency-aware penalty actually mitigates the artifacts predicted by the proxy-reuse hypothesis.

minor comments (2)

[Method] Notation for PMR and the interval-scheduling rule is introduced without an explicit equation or pseudocode; a compact definition would improve reproducibility.
[Figures] Figure captions and axis labels for the Pareto curves should explicitly state the exact speedup metric (e.g., tokens per second or FLOPs reduction) and the reference full-token baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of the core assumption and experimental reporting.

read point-by-point responses

Referee: [Abstract / DiTo design paragraph] Abstract and DiTo design description: the core assumption that 'output token similarity is consistently preserved across adjacent timesteps' so that prior-step similarities serve as reliable proxies is stated without any supporting quantification (e.g., per-timestep cosine similarity statistics, correlation coefficients between consecutive output similarity matrices, or ablation on proxy misalignment). This assumption is load-bearing for the interleaved Matching/Reduction schedule and the claimed PSNR gains; without it the reuse mechanism risks accumulating correspondence errors, especially in later timesteps or high-frequency regions.

Authors: We agree that explicit quantification would strengthen the manuscript. While the similarity-preservation observation underpins the PMR-guided scheduling and reuse mechanism, the initial submission did not include per-timestep statistics. In the revision we will add cosine-similarity statistics across adjacent timesteps, correlation coefficients between consecutive output similarity matrices, and an ablation on proxy misalignment to show that correspondence errors remain limited and do not materially degrade the reported PSNR gains. revision: yes
Referee: [Experiments] Experimental results section: the reported 1.6-3.9 dB PSNR improvements and superior Pareto frontier are presented without reference to specific baselines, number of random seeds, statistical significance tests, exact model configurations (e.g., DiT-XL/2 at 512×512), or failure-case analysis. This makes it impossible to verify whether the gains are robust or whether the frequency-aware penalty actually mitigates the artifacts predicted by the proxy-reuse hypothesis.

Authors: We acknowledge that additional experimental details are required for full reproducibility and verification. In the revised manuscript we will explicitly name the baseline TR methods, report results averaged over multiple random seeds, include statistical significance tests where applicable, specify the exact DiT configurations and resolutions (including DiT-XL/2 at 512×512), and provide a failure-case analysis demonstrating how the frequency-aware penalty reduces blocking artifacts arising from repeated proxy reuse. revision: yes

Circularity Check

0 steps flagged

No significant circularity: design choices are independent heuristics validated empirically.

full rationale

The paper states an observation about output token similarity preservation across timesteps and uses it to motivate two new algorithmic components (PMR-guided Interval Scheduling and Frequency-aware Token Matching with penalty). These are presented as design decisions rather than derived equations. No load-bearing step reduces by construction to a fitted parameter, self-referential definition, or unverified self-citation chain. The central claim (PSNR gains on a Pareto frontier) rests on experimental comparisons, not on any internal equivalence that would make the result tautological. This is the common case of a heuristic method with external empirical grounding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that output similarities remain stable across timesteps; no free parameters or invented physical entities are explicitly introduced in the abstract.

free parameters (1)

PMR interval parameters
Used to set matching frequency; exact values or fitting procedure not described in abstract.

axioms (1)

domain assumption Output token similarity is consistently preserved across adjacent timesteps
Invoked to justify reuse of prior-step matches as proxy for current timestep correspondences.

pith-pipeline@v0.9.0 · 5751 in / 1156 out tokens · 75879 ms · 2026-05-22T07:27:54.349045+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 2 internal anchors

[1]

In: International Conference on Learning Represen- tations (2023)

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Represen- tations (2023)

work page 2023
[2]

CVPR Workshop on Efficient Deep Learning for Computer Vision (2023)

Bolya, D., Hoffman, J.: Token merging for fast stable diffusion. CVPR Workshop on Efficient Deep Learning for Computer Vision (2023)

work page 2023
[3]

Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation (2024)

work page 2024
[4]

In: Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM)

Chen, Y., Ma, Z., Yang, C., An, Z., Zhang, Y.: Accelerating diffusion models via parallel denoising. In: Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM). pp. 10652–10661 (2025)

work page 2025
[5]

In: International Conference on Learning Representations (ICLR) (2024)

Dao, T.: FlashAttention-2: Faster attention with better parallelism and work par- titioning. In: International Conference on Learning Representations (ICLR) (2024)

work page 2024
[6]

In: Proceedings of the 41st International Conference on Machine Learning (ICML) (2024)

Esser, P., Kulal, S., Andreas, B., Enright, A., Sheynin, J., Sauer, A., Chen, D., Podell, D., Evans, D., Brack, M., et al.: Scaling rectified flow transformers for high- resolution image synthesis. In: Proceedings of the 41st International Conference on Machine Learning (ICML) (2024)

work page 2024
[7]

In: European Conference on Computer Vision (ECCV) (2022)

Fayyaz, M., Koohpayegani, S.A., Jafari, F.R., Sengupta, S., Joze, H.R.V., Som- merlade, E., Pirsiavash, H., Gall, J.: Adaptive token sampling for efficient vision transformers. In: European Conference on Computer Vision (ECCV) (2022)

work page 2022
[8]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Guo, Bowei, e.a.: Mosaicdiff: Training-free structural pruning for diffusion model acceleration reflecting pretraining dynamics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1655–1664 (2025), https://openaccess.thecvf.com/content/ICCV2025/html/Guo_MosaicDiff_ Training - free _ Structural _ Pruning _ for _ Diffusi...

work page 2025
[9]

In: Advances in Neural Information Processing Systems (NeurIPS)

He, Yefei, e.a.: Ptqd: Accurate post-training quantization for diffusion models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 36, pp. 13237–13249 (2023),https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/2aab8a76c7e761b66eccaca0927787de-Abstract-Conference.html

work page 2023
[10]

In: Conference on Empirical Methods in Natural Language Processing (EMNLP)

Hessel, Jack, e.a.: Clipscore: A reference-free evaluation metric for image cap- tioning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 7514–7528 (2021)

work page 2021
[11]

In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

Heusel, Martin, e.a.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

work page 2017
[12]

EURASIP Journal on Image and Video Processing pp

Huynh-Thu, Q., Ghanbari, M.: A study of the psnr metric for image quality as- sessment. EURASIP Journal on Image and Video Processing pp. 1–7 (2008)

work page 2008
[13]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Kim, Bo-Kyeong, e.a.: Bk-sdm: A lightweight, fast, and cheap version of sta- ble diffusion. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 381–399. Springer (2024),https://www.ecva.net/papers/eccv_ 2024/papers_ECCV/html/7138_ECCV_2024_paper.php

work page 2024
[14]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Kim, M., Gao, S., Hsu, Y.C., Shen, Y., Jin, H.: Token fusion: Bridging the gap between token pruning and token merging. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1372–1381 (2024)

work page 2024
[15]

Lee et al

Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al.: Spvit: Enabling faster vision transformers via latency-aware 16 H. Lee et al. soft token pruning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI. pp. 620–640. Springer (2022)

work page 2022
[16]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

work page 2024
[17]

Lee, Y., Park, K., Cho, Y., Lee, Y.J., Hwang, S.J.: Koala: Empirical lessons toward memory-efficientandfastdiffusionmodelsfortext-to-imagesynthesis.In:Advances in Neural Information Processing Systems. vol. 37, pp. 51597–51633 (2024)

work page 2024
[18]

Lefaudeux, B., Massa, F., Liskovich, D., Xiong, W., Caggiano, V., Naren, S., Xu, M., Hu, J., Tintore, M., Zhang, S., Labatut, P., Haziza, D., Wehrstedt, L., Reizen- stein, J., Sizov, G.: xformers: A modular and hackable transformer modelling li- brary.https://github.com/facebookresearch/xformers(2022)

work page 2022
[19]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Xiuyu, e.a.: Q-diffusion: Quantizing diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17535– 17545 (2023),https://openaccess.thecvf.com/content/ICCV2023/html/Li_Q- Diffusion_Quantizing_Diffusion_Models_ICCV_2023_paper.html

work page 2023
[20]

In: International Conference on Learning Representations (ICLR) (2022)

Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Evit: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (ICLR) (2022)

work page 2022
[21]

arXiv preprint arXiv:2401.04585 (2024),https: //arxiv.org/abs/2401.04585

Liu, Xuewen, e.a.: Eda-dm: Enhanced distribution alignment for post-training quantization of diffusion models. arXiv preprint arXiv:2401.04585 (2024),https: //arxiv.org/abs/2401.04585

work page arXiv 2024
[22]

In: Advances in Neural Information Processing Sys- tems (NeurIPS)

Lu, Cheng, e.a.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Advances in Neural Information Processing Sys- tems (NeurIPS). vol. 35, pp. 5775–5787 (2022),https://proceedings.neurips. cc / paper _ files / paper / 2022 / hash / 260a14acce2a89dad36adc8eefe7c59e - Abstract-Conference.html

work page 2022
[23]

Machine Intelligence Research22(4), 730–751 (2025).https: //doi.org/10.1007/s11633-025-1562-4,https://www.mi-research.net/en/ article/doi/10.1007/s11633-025-1562-4

Lu, Cheng, e.a.: Dpm-solver++: Fast solver for guided sampling of diffusion prob- abilistic models. Machine Intelligence Research22(4), 730–751 (2025).https: //doi.org/10.1007/s11633-025-1562-4,https://www.mi-research.net/en/ article/doi/10.1007/s11633-025-1562-4

work page doi:10.1007/s11633-025-1562-4 2025
[24]

In: Forty-second International Conference on Machine Learning (2025)

Lu, W., Zheng, S., Xia, Y., Wang, S.: ToMA: Token merge with attention for diffusion models. In: Forty-second International Conference on Machine Learning (2025)

work page 2025
[25]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, Simian, e.a.: Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023),https://arxiv. org/abs/2310.04378

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Marin, D., Chang, J.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: To- ken pooling in vision transformers for image classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 12–21 (2023)

work page 2023
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Meng, C., Rombach, R., Gao, R., Kingma, D.P., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14297– 14306 (2023),https://openaccess.thecvf.com/content/CVPR2023/html/Meng_ On_Distillation_of_Guided_Diffusion_Models_CVPR_2023_paper.html

work page 2023
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Meng, X., Li, X., Wang, Y., Wu, X., Zhang, Y., Sun, J.: Adavit: Adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

work page 2022
[29]

NVIDIA Corporation: Nvidia rtx 6000 ada generation.https://www.nvidia.com/ en-us/products/workstations/rtx-6000/(2023), accessed: 2026-03-05 DiTo 17

work page 2023
[30]

In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

Pan, X., Ge, C., Lu, R., Song, S., Huang, G., Wang, Z., Huang, Z.: Ia-red 2: Interpretability-aware redundancy reduction for vision transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

work page 2021
[31]

Scalable Diffusion Models with Transformers

Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lambert, N., Rasul, K., Davaadorj, M., Nair, D., Paul, S., Berman, W., Xu, Y., Liu, S., Wolf, T.: Diffusers: State-of-the-art diffusion models.https://github.com/huggingface/diffusers (2022)

work page 2022
[33]

In: Proceedings of the 20th Inter- national Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP

Proust, M., Martyna Poreba, Michal Szczepanski, K.H.: Step: Supertoken and early-pruning for efficient semantic segmentation. In: Proceedings of the 20th Inter- national Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP. pp. 50–61 (2025)

work page 2025
[34]

In: Advances in Neural In- formation Processing Systems (NeurIPS) (2021)

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. In: Advances in Neural In- formation Processing Systems (NeurIPS) (2021)

work page 2021
[35]

ImageNet Large Scale Visual Recognition Challenge,

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV)115(3), 211–252 (2015).https://doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[36]

In: International Conference on Learning Representations (ICLR) (2022)

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: International Conference on Learning Representations (ICLR) (2022)

work page 2022
[37]

Smith, E., Saxena, N., Saha, A.: Todo: Token downsampling for efficient generation of high-resolution images (2024),https://arxiv.org/abs/2402.13573

work page arXiv 2024
[38]

Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conference on Learning Representations (ICLR) (2021)

work page 2021
[39]

IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

work page 2004
[40]

Expert Systems with Applications288, 128206 (2025)

Yang, Y., Yue Zhou, Xiaofang Hu, S.D.: K-feature fusion token merging for vision transformer. Expert Systems with Applications288, 128206 (2025)

work page 2025
[41]

In: Advances in Neural Information Processing Systems (NeurIPS)

Yin, Tianwei, e.a.: Improved distribution matching distillation for fast image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 47455–47487 (2024).https : / / doi . org / 10 . 52202 / 079017 - 1505,https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / hash / 54dcf25318f9de5a7a01f0a4125c541e-Abstrac...

work page 2024
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-ViT: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10809– 10818 (June 2022)

work page 2022
[43]

In: Advances in Neu- ral Information Processing Systems (NeurIPS)

Zhan, Z., Wu, Y., Gong, Y., Meng, Z., Kong, Z., Yang, C., Wang, Y.: Fast and memory-efficient video diffusion using streamlined inference. In: Advances in Neu- ral Information Processing Systems (NeurIPS). vol. 37, pp. 13660–13684 (2024)

work page 2024
[44]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, Richard, e.a.: The unreasonable effectiveness of deep features as a percep- tual metric. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 586–595 (2018)

work page 2018
[45]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhang, E., Tang, J., Ning, X., Zhang, L.: Training-free and hardware-friendly ac- celeration for diffusion models via similarity-based token pruning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 9878–9886 (2025) 18 H. Lee et al

work page 2025
[46]

In: Advances in Neural Information Processing Systems (NeurIPS)

Zhao, W., Bai, L., Rao, Y., Zhou, J., Lu, J.: Unipc: A unified predictor- corrector framework for fast sampling of diffusion models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 36, pp. 49842–49869 (2023),https : / / proceedings . neurips . cc / paper _ files / paper / 2023 / hash / 9c2aa1e456ea543997f6927295196381-Abstract-Confer...

work page 2023
[47]

In: Advances in Neural Information Processing Systems (NeurIPS)

Zheng, X., Liu, X., Bian, Y., Ma, X., Zhang, Y., Wang, J., Qin, H.: Bidm: Pushing the limit of quantization for diffusion models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 39009–39035 (2024)

work page 2024
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

Zhou, Zhenyu, e.a.: Fast ode-based sampling for diffusion models in around 5 steps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). pp. 7777–7786 (2024),https://openaccess.thecvf. com/content/CVPR2024/html/Zhou_Fast_ODE-based_Sampling_for_Diffusion_ Models_in_Around_5_Steps_CVPR_2024_paper.html

work page 2024
[49]

arXiv preprint arXiv:2510.06751 (2025)

Zhu, J., Wang, H., Su, M., Wang, Z., Wang, H.: Obs-diff: Accurate pruning for diffusion models in one-shot. arXiv preprint arXiv:2510.06751 (2025)

work page arXiv 2025

[1] [1]

In: International Conference on Learning Represen- tations (2023)

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Represen- tations (2023)

work page 2023

[2] [2]

CVPR Workshop on Efficient Deep Learning for Computer Vision (2023)

Bolya, D., Hoffman, J.: Token merging for fast stable diffusion. CVPR Workshop on Efficient Deep Learning for Computer Vision (2023)

work page 2023

[3] [3]

Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation (2024)

work page 2024

[4] [4]

In: Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM)

Chen, Y., Ma, Z., Yang, C., An, Z., Zhang, Y.: Accelerating diffusion models via parallel denoising. In: Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM). pp. 10652–10661 (2025)

work page 2025

[5] [5]

In: International Conference on Learning Representations (ICLR) (2024)

Dao, T.: FlashAttention-2: Faster attention with better parallelism and work par- titioning. In: International Conference on Learning Representations (ICLR) (2024)

work page 2024

[6] [6]

In: Proceedings of the 41st International Conference on Machine Learning (ICML) (2024)

Esser, P., Kulal, S., Andreas, B., Enright, A., Sheynin, J., Sauer, A., Chen, D., Podell, D., Evans, D., Brack, M., et al.: Scaling rectified flow transformers for high- resolution image synthesis. In: Proceedings of the 41st International Conference on Machine Learning (ICML) (2024)

work page 2024

[7] [7]

In: European Conference on Computer Vision (ECCV) (2022)

Fayyaz, M., Koohpayegani, S.A., Jafari, F.R., Sengupta, S., Joze, H.R.V., Som- merlade, E., Pirsiavash, H., Gall, J.: Adaptive token sampling for efficient vision transformers. In: European Conference on Computer Vision (ECCV) (2022)

work page 2022

[8] [8]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Guo, Bowei, e.a.: Mosaicdiff: Training-free structural pruning for diffusion model acceleration reflecting pretraining dynamics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1655–1664 (2025), https://openaccess.thecvf.com/content/ICCV2025/html/Guo_MosaicDiff_ Training - free _ Structural _ Pruning _ for _ Diffusi...

work page 2025

[9] [9]

In: Advances in Neural Information Processing Systems (NeurIPS)

He, Yefei, e.a.: Ptqd: Accurate post-training quantization for diffusion models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 36, pp. 13237–13249 (2023),https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/2aab8a76c7e761b66eccaca0927787de-Abstract-Conference.html

work page 2023

[10] [10]

In: Conference on Empirical Methods in Natural Language Processing (EMNLP)

Hessel, Jack, e.a.: Clipscore: A reference-free evaluation metric for image cap- tioning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 7514–7528 (2021)

work page 2021

[11] [11]

In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

Heusel, Martin, e.a.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)

work page 2017

[12] [12]

EURASIP Journal on Image and Video Processing pp

Huynh-Thu, Q., Ghanbari, M.: A study of the psnr metric for image quality as- sessment. EURASIP Journal on Image and Video Processing pp. 1–7 (2008)

work page 2008

[13] [13]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Kim, Bo-Kyeong, e.a.: Bk-sdm: A lightweight, fast, and cheap version of sta- ble diffusion. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 381–399. Springer (2024),https://www.ecva.net/papers/eccv_ 2024/papers_ECCV/html/7138_ECCV_2024_paper.php

work page 2024

[14] [14]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Kim, M., Gao, S., Hsu, Y.C., Shen, Y., Jin, H.: Token fusion: Bridging the gap between token pruning and token merging. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1372–1381 (2024)

work page 2024

[15] [15]

Lee et al

Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al.: Spvit: Enabling faster vision transformers via latency-aware 16 H. Lee et al. soft token pruning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI. pp. 620–640. Springer (2022)

work page 2022

[16] [16]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

work page 2024

[17] [17]

Lee, Y., Park, K., Cho, Y., Lee, Y.J., Hwang, S.J.: Koala: Empirical lessons toward memory-efficientandfastdiffusionmodelsfortext-to-imagesynthesis.In:Advances in Neural Information Processing Systems. vol. 37, pp. 51597–51633 (2024)

work page 2024

[18] [18]

Lefaudeux, B., Massa, F., Liskovich, D., Xiong, W., Caggiano, V., Naren, S., Xu, M., Hu, J., Tintore, M., Zhang, S., Labatut, P., Haziza, D., Wehrstedt, L., Reizen- stein, J., Sizov, G.: xformers: A modular and hackable transformer modelling li- brary.https://github.com/facebookresearch/xformers(2022)

work page 2022

[19] [19]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Xiuyu, e.a.: Q-diffusion: Quantizing diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17535– 17545 (2023),https://openaccess.thecvf.com/content/ICCV2023/html/Li_Q- Diffusion_Quantizing_Diffusion_Models_ICCV_2023_paper.html

work page 2023

[20] [20]

In: International Conference on Learning Representations (ICLR) (2022)

Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Evit: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (ICLR) (2022)

work page 2022

[21] [21]

arXiv preprint arXiv:2401.04585 (2024),https: //arxiv.org/abs/2401.04585

Liu, Xuewen, e.a.: Eda-dm: Enhanced distribution alignment for post-training quantization of diffusion models. arXiv preprint arXiv:2401.04585 (2024),https: //arxiv.org/abs/2401.04585

work page arXiv 2024

[22] [22]

In: Advances in Neural Information Processing Sys- tems (NeurIPS)

Lu, Cheng, e.a.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Advances in Neural Information Processing Sys- tems (NeurIPS). vol. 35, pp. 5775–5787 (2022),https://proceedings.neurips. cc / paper _ files / paper / 2022 / hash / 260a14acce2a89dad36adc8eefe7c59e - Abstract-Conference.html

work page 2022

[23] [23]

Machine Intelligence Research22(4), 730–751 (2025).https: //doi.org/10.1007/s11633-025-1562-4,https://www.mi-research.net/en/ article/doi/10.1007/s11633-025-1562-4

Lu, Cheng, e.a.: Dpm-solver++: Fast solver for guided sampling of diffusion prob- abilistic models. Machine Intelligence Research22(4), 730–751 (2025).https: //doi.org/10.1007/s11633-025-1562-4,https://www.mi-research.net/en/ article/doi/10.1007/s11633-025-1562-4

work page doi:10.1007/s11633-025-1562-4 2025

[24] [24]

In: Forty-second International Conference on Machine Learning (2025)

Lu, W., Zheng, S., Xia, Y., Wang, S.: ToMA: Token merge with attention for diffusion models. In: Forty-second International Conference on Machine Learning (2025)

work page 2025

[25] [25]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, Simian, e.a.: Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023),https://arxiv. org/abs/2310.04378

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Marin, D., Chang, J.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: To- ken pooling in vision transformers for image classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 12–21 (2023)

work page 2023

[27] [27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Meng, C., Rombach, R., Gao, R., Kingma, D.P., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14297– 14306 (2023),https://openaccess.thecvf.com/content/CVPR2023/html/Meng_ On_Distillation_of_Guided_Diffusion_Models_CVPR_2023_paper.html

work page 2023

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Meng, X., Li, X., Wang, Y., Wu, X., Zhang, Y., Sun, J.: Adavit: Adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

work page 2022

[29] [29]

NVIDIA Corporation: Nvidia rtx 6000 ada generation.https://www.nvidia.com/ en-us/products/workstations/rtx-6000/(2023), accessed: 2026-03-05 DiTo 17

work page 2023

[30] [30]

In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

Pan, X., Ge, C., Lu, R., Song, S., Huang, G., Wang, Z., Huang, Z.: Ia-red 2: Interpretability-aware redundancy reduction for vision transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

work page 2021

[31] [31]

Scalable Diffusion Models with Transformers

Peebles, W., Xie, S.: Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lambert, N., Rasul, K., Davaadorj, M., Nair, D., Paul, S., Berman, W., Xu, Y., Liu, S., Wolf, T.: Diffusers: State-of-the-art diffusion models.https://github.com/huggingface/diffusers (2022)

work page 2022

[33] [33]

In: Proceedings of the 20th Inter- national Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP

Proust, M., Martyna Poreba, Michal Szczepanski, K.H.: Step: Supertoken and early-pruning for efficient semantic segmentation. In: Proceedings of the 20th Inter- national Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP. pp. 50–61 (2025)

work page 2025

[34] [34]

In: Advances in Neural In- formation Processing Systems (NeurIPS) (2021)

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. In: Advances in Neural In- formation Processing Systems (NeurIPS) (2021)

work page 2021

[35] [35]

ImageNet Large Scale Visual Recognition Challenge,

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV)115(3), 211–252 (2015).https://doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015

[36] [36]

In: International Conference on Learning Representations (ICLR) (2022)

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: International Conference on Learning Representations (ICLR) (2022)

work page 2022

[37] [37]

Smith, E., Saxena, N., Saha, A.: Todo: Token downsampling for efficient generation of high-resolution images (2024),https://arxiv.org/abs/2402.13573

work page arXiv 2024

[38] [38]

Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conference on Learning Representations (ICLR) (2021)

work page 2021

[39] [39]

IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

work page 2004

[40] [40]

Expert Systems with Applications288, 128206 (2025)

Yang, Y., Yue Zhou, Xiaofang Hu, S.D.: K-feature fusion token merging for vision transformer. Expert Systems with Applications288, 128206 (2025)

work page 2025

[41] [41]

In: Advances in Neural Information Processing Systems (NeurIPS)

Yin, Tianwei, e.a.: Improved distribution matching distillation for fast image synthesis. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 47455–47487 (2024).https : / / doi . org / 10 . 52202 / 079017 - 1505,https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / hash / 54dcf25318f9de5a7a01f0a4125c541e-Abstrac...

work page 2024

[42] [42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-ViT: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10809– 10818 (June 2022)

work page 2022

[43] [43]

In: Advances in Neu- ral Information Processing Systems (NeurIPS)

Zhan, Z., Wu, Y., Gong, Y., Meng, Z., Kong, Z., Yang, C., Wang, Y.: Fast and memory-efficient video diffusion using streamlined inference. In: Advances in Neu- ral Information Processing Systems (NeurIPS). vol. 37, pp. 13660–13684 (2024)

work page 2024

[44] [44]

In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, Richard, e.a.: The unreasonable effectiveness of deep features as a percep- tual metric. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 586–595 (2018)

work page 2018

[45] [45]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhang, E., Tang, J., Ning, X., Zhang, L.: Training-free and hardware-friendly ac- celeration for diffusion models via similarity-based token pruning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 9878–9886 (2025) 18 H. Lee et al

work page 2025

[46] [46]

In: Advances in Neural Information Processing Systems (NeurIPS)

Zhao, W., Bai, L., Rao, Y., Zhou, J., Lu, J.: Unipc: A unified predictor- corrector framework for fast sampling of diffusion models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 36, pp. 49842–49869 (2023),https : / / proceedings . neurips . cc / paper _ files / paper / 2023 / hash / 9c2aa1e456ea543997f6927295196381-Abstract-Confer...

work page 2023

[47] [47]

In: Advances in Neural Information Processing Systems (NeurIPS)

Zheng, X., Liu, X., Bian, Y., Ma, X., Zhang, Y., Wang, J., Qin, H.: Bidm: Pushing the limit of quantization for diffusion models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 39009–39035 (2024)

work page 2024

[48] [48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

Zhou, Zhenyu, e.a.: Fast ode-based sampling for diffusion models in around 5 steps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). pp. 7777–7786 (2024),https://openaccess.thecvf. com/content/CVPR2024/html/Zhou_Fast_ODE-based_Sampling_for_Diffusion_ Models_in_Around_5_Steps_CVPR_2024_paper.html

work page 2024

[49] [49]

arXiv preprint arXiv:2510.06751 (2025)

Zhu, J., Wang, H., Su, M., Wang, Z., Wang, H.: Obs-diff: Accurate pruning for diffusion models in one-shot. arXiv preprint arXiv:2510.06751 (2025)

work page arXiv 2025