pith. machine review for the scientific record. sign in

arxiv: 2604.22379 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Efficient Diffusion Distillation via Embedding Loss

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion distillationembedding lossmaximum mean discrepancyfew-step generationCIFAR-10FID scoredistribution matchinggenerative models
0
0 comments X

The pith

Embedding Loss aligns distributions with random network features to boost few-step diffusion generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Embedding Loss as a supplementary function for distilling diffusion models into efficient one- or few-step generators. It extracts features from several randomly initialized networks and uses Maximum Mean Discrepancy to match the student's output distribution to the data distribution. This avoids the need for large pre-generated datasets required by regression losses and the instability of GAN losses. The result is higher quality samples and much faster training even with small batches. A reader would care because it lowers the barrier to using powerful generative models in settings with limited compute.

Core claim

We propose Embedding Loss (EL) that complements diffusion distillation methods by aligning feature distributions between the few-step generator and original data via MMD computed on embeddings from a diverse set of randomly initialized networks. This preserves fidelity and diversity, leading to state-of-the-art FID scores of 1.475 unconditional and 1.380 conditional on CIFAR-10 for one-step models, with up to 80% fewer training iterations across multiple frameworks and datasets.

What carries the argument

Embedding Loss (EL): computes Maximum Mean Discrepancy (MMD) in the feature space of randomly initialized networks to match the distribution of the distilled generator to the data.

Load-bearing premise

Feature embeddings from a diverse set of randomly initialized networks provide a robust and stable signal for distribution matching without introducing instabilities or requiring extensive tuning.

What would settle it

A controlled experiment on CIFAR-10 showing that one-step generators trained with Embedding Loss do not achieve lower FID scores or faster convergence compared to baselines without it would disprove the benefit.

Figures

Figures reproduced from arXiv: 2604.22379 by Jincheng Ying, Li Wenlin, Minghui Xu, Yinhao Xiao, Yitao Chen.

Figure 1
Figure 1. Figure 1: Method overview. We train a one-step generator Gθ to map noisy images into realistic outputs while maintaining distributional alignment with real data. The framework consists of three key components: (1) Forward diffusion and denoising pipeline (top row). Clean images x0 (e.g., the raccoon portrait) undergo forward diffusion by adding Gaussian noise ϵ ∼ N (0, I) to produce noisy images xs = αsx0 + σsϵ. The… view at source ↗
Figure 2
Figure 2. Figure 2: SiD2A training time comparison on ImageNet 512×512. 4.1 Experimental Settings Datasets We assess EL’s effectiveness across four standard benchmarks from EDM [23]: CIFAR-10 32×32 (cond/uncond) [17], ImageNet 64×64,512×512 [19], FFHQ 64×64 [20], and AFHQ-v2 64×64 [18]. Distillation Setup In this experiment, we apply DMD [9], DI [8], and SiD2A [15] with EL to distill pre-trained EDM [23] diffusion models into… view at source ↗
Figure 3
Figure 3. Figure 3: DI Convergence Speed Comparison on CIFAR-10 view at source ↗
Figure 5
Figure 5. Figure 5: Unconditional CIFAR-10 32 × 32 random images generated with DI+EL (FID: 3.95). 31 view at source ↗
Figure 6
Figure 6. Figure 6: Unconditional CIFAR-10 32 × 32 random images generated with SiD2A+EL (FID: 1.475). 32 view at source ↗
Figure 7
Figure 7. Figure 7: Label-conditioned CIFAR-10 32 × 32 random images generated with SiD2A+EL (FID: 1.38). 33 view at source ↗
Figure 8
Figure 8. Figure 8: FFHQ 64 × 64 random images generated with SiD2A+EL (FID: 1.06). 34 view at source ↗
Figure 9
Figure 9. Figure 9: AFHQ-V2 64 × 64 random images generated with SiD2A+EL (FID: 1.26). 35 view at source ↗
Figure 10
Figure 10. Figure 10: ImageNet 512 × 512 random images generated with SiD2A+EL (FID: 2.132). 36 view at source ↗
read the original abstract

Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and extended training periods, limiting accessibility for resource-constrained researchers, and existing supplementary loss functions have notable limitations. Regression loss requires pre-generating large datasets before training and limits the student model to the teacher's performance, while GAN-based losses suffer from training instability and require careful tuning. In this paper, we propose Embedding Loss (EL), a novel supplementary loss function that complements existing diffusion distillation methods to enhance generation quality and accelerate training with smaller batch sizes. Leveraging feature embeddings from a diverse set of randomly initialized networks, EL effectively aligns the feature distributions between the distilled few-step generator and the original data. By computing Maximum Mean Discrepancy (MMD) in the embedded feature space, EL ensures robust distribution matching, thereby preserving sample fidelity and diversity during distillation. Within distribution matching distillation frameworks, EL demonstrates strong empirical performance for one-step generators. On the CIFAR-10 dataset, our approach achieves state-of-the-art FID values of 1.475 for unconditional generation and 1.380 for conditional generation. Beyond CIFAR-10, we further validate EL across multiple benchmarks and distillation methods, including ImageNet, AFHQ-v2, and FFHQ datasets, using DMD, DI, and CM distillation frameworks, demonstrating consistent improvements over existing one-step distillation methods. Our method also reduces training iterations by up to 80%, offering a more practical and scalable solution for deploying diffusion-based generative models in resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Embedding Loss (EL) as a supplementary loss for distilling diffusion models into few-step generators. EL computes Maximum Mean Discrepancy (MMD) between feature embeddings extracted from the generated samples and real data using a diverse set of randomly initialized networks. This is intended to provide stable distribution matching without the limitations of regression losses (requiring pre-generated data) or GAN losses (instability). The paper reports state-of-the-art FID scores on CIFAR-10 (1.475 unconditional, 1.380 conditional) and up to 80% reduction in training iterations, with validation on ImageNet, AFHQ-v2, FFHQ using DMD, DI, and CM frameworks.

Significance. If the empirical results hold under rigorous controls, EL could provide a practical, low-tuning alternative for diffusion distillation that avoids pre-generating large teacher datasets and mitigates GAN instability, enabling faster training of one-step generators with smaller batches. The multi-framework validation and reported iteration reductions would lower barriers for resource-constrained deployment of high-quality generative models.

major comments (2)
  1. [Method (Embedding Loss)] The definition of Embedding Loss relies on MMD in the feature space of randomly initialized networks. No analysis is provided of sensitivity to the random seeds used for these embedding networks, nor are FID scores or training curves reported across multiple independent initializations of the ensemble. This directly affects the central claim that EL 'ensures robust distribution matching' and preserves fidelity/diversity without introducing new instabilities.
  2. [Experiments (CIFAR-10 and efficiency results)] The claims of SOTA FID (1.475/1.380 on CIFAR-10) and up to 80% training-iteration reduction lack reported variance, number of runs, statistical significance tests, and precise baseline controls (e.g., identical batch sizes, hardware, and whether final performance is matched at the reduced iteration count). These omissions make it impossible to assess whether the gains are reproducible and load-bearing for the efficiency and quality assertions.
minor comments (1)
  1. [Abstract and Method] Clarify in the abstract and method whether the embedding networks are frozen after random initialization or updated during distillation, and specify the exact number and architectures of the 'diverse set' of networks used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging areas where additional evidence or clarification is warranted. We have revised the manuscript to incorporate new analyses and details where feasible.

read point-by-point responses
  1. Referee: [Method (Embedding Loss)] The definition of Embedding Loss relies on MMD in the feature space of randomly initialized networks. No analysis is provided of sensitivity to the random seeds used for these embedding networks, nor are FID scores or training curves reported across multiple independent initializations of the ensemble. This directly affects the central claim that EL 'ensures robust distribution matching' and preserves fidelity/diversity without introducing new instabilities.

    Authors: We acknowledge that the original manuscript does not include explicit sensitivity analysis to the random seeds of the embedding networks or results across multiple ensemble initializations. To address this directly, we have performed additional experiments in the revision by re-initializing the ensemble of random networks with different seeds and re-running the distillation process. The updated results, now included in a new subsection and supplementary figures, show that FID scores vary by less than 0.05 across seeds and training curves remain consistent, supporting the claim of robust distribution matching. We have also added a short discussion noting that the diversity of multiple randomly initialized networks inherently mitigates seed-specific effects without introducing instabilities, as the MMD objective averages over the ensemble. revision: yes

  2. Referee: [Experiments (CIFAR-10 and efficiency results)] The claims of SOTA FID (1.475/1.380 on CIFAR-10) and up to 80% training-iteration reduction lack reported variance, number of runs, statistical significance tests, and precise baseline controls (e.g., identical batch sizes, hardware, and whether final performance is matched at the reduced iteration count). These omissions make it impossible to assess whether the gains are reproducible and load-bearing for the efficiency and quality assertions.

    Authors: We agree that greater transparency on variance, run counts, and controls would improve the presentation. The reported FID values and iteration reductions were obtained under fixed seeds with batch sizes and hardware matched to the original baseline implementations (as detailed in the experimental setup section). Due to the substantial compute required for full diffusion distillation, we did not originally run multiple independent trials. In the revised manuscript, we have expanded the experimental details to specify exact batch sizes, hardware (e.g., number of GPUs and training time per iteration), and confirmation that the 80% iteration reduction reaches final performance comparable to or better than baselines trained to convergence. We have also added per-run variance from multiple test-set evaluations and a note on statistical significance via paired comparisons where applicable. These changes make the efficiency and quality claims more reproducible without altering the core results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical proposal with independent validation

full rationale

The paper introduces Embedding Loss (EL) as a new supplementary objective using MMD on features from randomly initialized networks to aid diffusion distillation. No equations, derivations, or self-citations are shown that reduce the reported FID gains or iteration reductions to fitted inputs by construction, self-definition, or renamed known results. Validation is presented as empirical across CIFAR-10, ImageNet, AFHQ-v2, FFHQ and multiple frameworks (DMD, DI, CM), with no load-bearing uniqueness theorems or ansatz smuggling from prior author work. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about MMD as a distribution distance and the utility of random network features for alignment; no new entities or fitted parameters are introduced in the abstract description.

axioms (1)
  • domain assumption MMD computed in feature space from random networks reliably measures and minimizes distribution mismatch between generated and real images.
    Invoked when claiming that EL ensures robust distribution matching and preserves fidelity/diversity.

pith-pipeline@v0.9.0 · 5583 in / 1171 out tokens · 55723 ms · 2026-05-08T12:46:59.154411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 19 canonical work pages · 6 internal anchors

  1. [1]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020. 10 Efficient Diffusion Distillation via Embedding Loss

  2. [2]

    Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

  3. [3]

    Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

  4. [4]

    Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

  5. [5]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  6. [6]

    Progressive Distillation for Fast Sampling of Diffusion Models, June 2022

    Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models, June 2022

  7. [7]

    Consistency Models, May 2023

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency Models, May 2023

  8. [8]

    Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models.Advances in Neural Information Processing Systems, 36:76525–76546, December 2023

    Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models.Advances in Neural Information Processing Systems, 36:76525–76546, December 2023

  9. [9]

    Freeman, and Taesung Park

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step Diffusion with Distribution Matching Distillation, October 2024

  10. [10]

    Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation

    Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-first International Conference on Machine Learning, 2024

  11. [11]

    One-step diffusion distillation through score implicit matching.Advances in Neural Information Processing Systems, 37:115377–115408, 2024

    Weijian Luo, Zemin Huang, Zhengyang Geng, J Zico Kolter, and Guo-jun Qi. One-step diffusion distillation through score implicit matching.Advances in Neural Information Processing Systems, 37:115377–115408, 2024

  12. [12]

    Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

  13. [13]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved Distribution Matching Distillation for Fast Image Synthesis, May 2024

  14. [14]

    A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

    Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.The journal of machine learning research, 13(1):723–773, 2012

  15. [15]

    Adversarial score identity distil- lation: Rapidly surpassing the teacher in one step

    Mingyuan Zhou, Huangjie Zheng, Yi Gu, Zhendong Wang, and Hai Huang. Adversarial score identity distil- lation: Rapidly surpassing the teacher in one step. InThe Thirteenth International Conference on Learning Representations, 2025

  16. [16]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  17. [17]

    Krizhevsky and G

    A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.Handbook of Systemic Autoimmune Diseases, 1(4), 2009

  18. [18]

    Stargan v2: Diverse image synthesis for multiple domains.IEEE, 2020

    Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung Woo Ha. Stargan v2: Diverse image synthesis for multiple domains.IEEE, 2020

  19. [19]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

  20. [20]

    A style-based generator architecture for generative adversarial networks.IEEE, 2019

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks.IEEE, 2019

  21. [21]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  22. [22]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  23. [23]

    Elucidating the Design Space of Diffusion-Based Generative Models, October 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion-Based Generative Models, October 2022

  24. [24]

    DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps, October 2022

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps, October 2022

  25. [25]

    Fast sampling of dif- fusion models with exponential integrator

    Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv preprint arXiv:2204.13902, 2022. 11 Efficient Diffusion Distillation via Embedding Loss

  26. [26]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  27. [27]

    Diffusevae: Effi- cient, controllable and high-fidelity generation from low-dimensional latents.arXiv preprint arXiv:2201.00308, 2022

    Kushagra Pandey, Avideep Mukherjee, Piyush Rai, and Abhishek Kumar. Diffusevae: Efficient, controllable and high-fidelity generation from low-dimensional latents.arXiv preprint arXiv:2201.00308, 2022

  28. [28]

    Accelerating diffusion models via early stop of the diffusion process

    Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process.arXiv preprint arXiv:2205.12524, 2022

  29. [29]

    Diffusion- GAN: Training gans with diffusion

    Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Diffusion-gan: Training gans with diffusion.arXiv preprint arXiv:2206.02262, 2022

  30. [30]

    Tackling the generative learning trilemma with denoising diffusion gans

    Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion gans.arXiv preprint arXiv:2112.07804, 2021

  31. [31]

    Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis, November 2024

    Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis, November 2024

  32. [32]

    arXiv preprint arXiv:2402.13929 (2024) 5

    Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024

  33. [33]

    Adversarial Diffusion Distillation, November 2023

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial Diffusion Distillation, November 2023

  34. [34]

    Rectified flow: A marginal preserving approach to o ptimal transport

    Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport.arXiv preprint arXiv:2209.14577, 2022

  35. [35]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  36. [36]

    Dataset Condensation with Distribution Matching

    Bo Zhao and Hakan Bilen. Dataset Condensation with Distribution Matching. In2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6503–6512, Waikoloa, HI, USA, January 2023. IEEE

  37. [37]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  38. [38]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  39. [39]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

  40. [40]

    Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

  41. [41]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInterna- tional conference on machine learning, pages 8162–8171. PMLR, 2021

  42. [42]

    Hierarchical semi-implicit variational inference with application to diffusion model acceleration.Advances in Neural Information Processing Systems, 36:49603–49627, 2023

    Longlin Yu, Tianyu Xie, Yu Zhu, Tong Yang, Xiangyu Zhang, and Cheng Zhang. Hierarchical semi-implicit variational inference with application to diffusion model acceleration.Advances in Neural Information Processing Systems, 36:49603–49627, 2023

  43. [43]

    Learning stackable and skippable lego bricks for efficient, reconfigurable, and variable-resolution diffusion modeling.arXiv preprint arXiv:2310.06389, 2023

    Huangjie Zheng, Zhendong Wang, Jianbo Yuan, Guanghan Ning, Pengcheng He, Quanzeng You, Hongxia Yang, and Mingyuan Zhou. Learning stackable and skippable lego bricks for efficient, reconfigurable, and variable-resolution diffusion modeling.arXiv preprint arXiv:2310.06389, 2023

  44. [44]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020

  45. [45]

    Dire for diffusion-generated image detection

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22445–22455, 2023

  46. [46]

    Improved techniques for training consistency models

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023

  47. [48]

    Diffusion models are innate one-step generators.arXiv preprint arXiv:2405.20750, 2024

    Bowen Zheng and Tianming Yang. Diffusion models are innate one-step generators.arXiv preprint arXiv:2405.20750, 2024

  48. [49]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096, 2018

  49. [50]

    One-step diffusion distillation via deep equilibrium models

    Zhengyang Geng, Ashwini Pokle, and J Zico Kolter. One-step diffusion distillation via deep equilibrium models. Advances in Neural Information Processing Systems, 36:41914–41931, 2023

  50. [51]

    Patch diffusion: Faster and more data-efficient training of diffusion models.Advances in neural information processing systems, 36:72137–72154, 2023

    Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, Mingyuan Zhou, et al. Patch diffusion: Faster and more data-efficient training of diffusion models.Advances in neural information processing systems, 36:72137–72154, 2023

  51. [52]

    Boot: Data-free distillation of denoising diffusion models with bootstrapping

    Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, and Joshua M Susskind. Boot: Data-free distillation of denoising diffusion models with bootstrapping. InICML 2023 Workshop on Structured Probabilistic Inference {\&}Generative Modeling, volume 3, 2023

  52. [53]

    Scalable adaptive computation for iterative generation,

    Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation.arXiv preprint arXiv:2212.11972, 2022

  53. [54]

    Stylegan-xl: Scaling stylegan to large diverse datasets

    Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. InACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022

  54. [55]

    Fast sampling of diffusion models via operator learning

    Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning. InInternational conference on machine learning, pages 42390–42402. PMLR, 2023

  55. [56]

    Tract: Denoising diffusion models with transitive closure time-distillation

    David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbott, and Eric Gu. Tract: Denoising diffusion models with transitive closure time-distillation.arXiv preprint arXiv:2303.04248, 2023

  56. [57]

    On distillation of guided diffusion models

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14297–14306, 2023

  57. [58]

    Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion, March 2024

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion.arXiv preprint arXiv:2310.02279, 2023. 13 Efficient Diffusion Distillation via Embedding Loss Appendix for EL A Convergence Speed Com...

  58. [59]

    Lipschitz continuity of the score function: If the score function sθ(xt) and the forward score ∇logq t(xt |x 0) satisfy the Lipschitz condition (i.e., ∥sθ(xt)−s θ(˜xt)∥ ≤L∥x t −˜xt∥), the fluctuation of the gradient estimate will be bounded

  59. [60]

    In conclusion, the variance decomposition and the conclusion regarding theO(1/B)term hold

    Boundedness of the weight function: If the gradient estimate involves a weight function w(t) (e.g., the time weight in VeB-SDE), the boundedness ofw(t)(e.g.,∥w(t)∥ ≤W) further controls the variance growth. In conclusion, the variance decomposition and the conclusion regarding theO(1/B)term hold. Corollary 1 (Batch Size Dependency):Assume the data distribu...

  60. [61]

    For example, generating 500,000 pairs with Heun solver (18 steps for CIFAR-10, 256 steps for ImageNet)

    Dependency on pre-generated dataset:Requires constructing D offline using the teacher model with expensive deterministic sampling: |D| ≫B(typically|D| ≈500,000pairs) (23) This consumes significant computational resources before training even begins. For example, generating 500,000 pairs with Heun solver (18 steps for CIFAR-10, 256 steps for ImageNet)

  61. [62]

    Fixed dataset staleness:Since D is pre-generated, it represents a snapshot of the teacher’s capabilities at a fixed random seed and does not adapt during student training: D={(z j, µbase(zj))}|D| j=1 is static (24) This limits the diversity of training signals compared to online sampling

  62. [63]

    Limited coverage:Even with 500,000 samples,Dmay not cover all modes of the true distribution: Coverage(D)<Coverage(p data)(25) 21 Efficient Diffusion Distillation via Embedding Loss C.5 Analysis of Adversarial Loss Following DMD2’s approach [13], the adversarial loss adds a classification branch D (discriminator) on top of the diffusion model’s bottleneck...

  63. [64]

    DefiningL t(θ)as the loss at training iterationt, we have: ∇θLt(θ)̸=∇ θLt′(θ)fort̸=t ′ (28) This violates standard convergence assumptions for SGD

    Non-stationary optimization:Unlike standard supervised learning, the loss landscape changes as D is updated. DefiningL t(θ)as the loss at training iterationt, we have: ∇θLt(θ)̸=∇ θLt′(θ)fort̸=t ′ (28) This violates standard convergence assumptions for SGD

  64. [65]

    Gradient instability:WhenDapproaches optimality,D(F(G θ(z), t))→0, leading to: ∥∇θLadv(θ)∥ ∝ ∇yD(y) D(y) y=F(G θ(z),t) → ∞ This gradient explosion necessitates careful techniques such as gradient clipping or specialized loss formulations

  65. [66]

    Small perturbations can lead to oscillations or divergence, requiring careful learning rate scheduling for both networks

    Equilibrium stability:The Nash equilibrium (θ∗, D∗) may be unstable. Small perturbations can lead to oscillations or divergence, requiring careful learning rate scheduling for both networks

  66. [67]

    stochastic gradient variance

    Computational cost:Each training iteration requires updating both Gθ and D. While D is typically smaller than Gθ, the overall computational overhead increases by approximately 1.5-2× compared to single-network training. Memory usage also increases due to storing activations for both networks during backpropagation. C.6 Embedding Loss Theory Assumption 1 (...

  67. [68]

    Thus: 1−λ ∗ = σ2 2 −r u−2r Step 2: Calculate(λ ∗)2 Squareλ ∗: (λ∗)2 = σ2 1 −r u−2r 2 = (σ2 1 −r) 2 (u−2r) 2 Step 3: Calculate(1−λ ∗)2 Similarly, square1−λ ∗ using the result from Step 1: (1−λ ∗)2 = σ2 2 −r u−2r 2 = (σ2 2 −r) 2 (u−2r) 2 Step 4: Calculateλ ∗(1−λ ∗) This is the product of two terms: λ∗(1−λ ∗) = σ2 1 −r u−2r · σ2 2 −r u−2r = (σ2 1 −r)(σ 2 2 −...

  68. [69]

    Distribution Alignment:By minimizing MMD in multiple feature spaces, EL ensures pθ ≈p data globally, which by Theorem 1 reduces∥∆∥

  69. [70]

    Implicit Score Correction:By Theorem 3, the EL gradient provides sample-wise corrections in the direction of∆ eff, compensating for teacher model limitations

  70. [71]

    Proposition 1 (Advantage over Alternatives): • vs

    Multi-scale Matching:Using diverse embeddings E, EL captures distributional discrepancies at multiple scales and semantic levels, providing comprehensive coverage of the gap. Proposition 1 (Advantage over Alternatives): • vs. Regression Loss:Pure regression Lreg =E[∥G θ −f ϕ∥2] only ensures Gθ ≈f ϕ pointwise, inheriting all teacher limitations (including∆...