pith. sign in

arxiv: 2311.04938 · v5 · pith:H7OYZE3Mnew · submitted 2023-11-08 · 💻 cs.CV · cs.AI· cs.LG

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

Pith reviewed 2026-05-24 05:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords DDIMGaussian mixture modelmoment matchingdiffusion modelsaccelerated samplingFIDISrectified flow
0
0 comments X

The pith

Moment-matched Gaussian mixture kernels improve DDIM sample quality at small step counts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces the Gaussian reverse kernel inside DDIM with a Gaussian mixture model whose parameters are set so that the mixture matches the first two central moments of the DDPM forward marginals. Experiments on unconditional face models, class-conditional ImageNet models, Stable Diffusion text-to-image, and rectified flow models show that this change produces equal or higher FID and IS scores, with the largest gains at low step counts such as 10 steps. The result matters because it supplies a direct way to reduce the number of denoising steps while keeping or raising generation quality. A sympathetic reader would therefore see the GMM construction as a lightweight upgrade to an already popular accelerated sampler.

Core claim

Constraining the parameters of a Gaussian mixture model to match the first and second central moments of the DDPM forward marginals yields a reverse transition kernel that, inside the DDIM framework, produces samples whose quality is at least as good as and often better than the quality obtained with the original Gaussian kernel, with the advantage most visible when the number of sampling steps is small.

What carries the argument

The moment-constrained GMM reverse kernel that replaces the Gaussian transition operator in DDIM.

If this is right

  • Ten-step sampling on ImageNet 256x256 reaches FID 6.94 and IS 207.85 with the GMM kernel versus 10.15 and 196.73 with the Gaussian kernel.
  • Quality gains appear on CelebAHQ, FFHQ, and Stable Diffusion text-to-image models.
  • The same moment-matching GMM improves sampling from both 1-rectified and 2-rectified flow models.
  • First- and second-order moment matching alone is presented as sufficient for the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Higher-order moments of the forward marginals may not be required for competitive reverse sampling.
  • The GMM kernel could be inserted into other non-Markovian diffusion samplers beyond DDIM.
  • Similar moment-matching constructions might apply to continuous-time diffusion SDEs.
  • Fewer sampling steps would lower the inference compute needed for high-quality image generation.

Load-bearing premise

That matching only the first and second central moments of the DDPM forward marginals is enough to obtain equal or better sample quality than the original Gaussian DDIM kernel.

What would settle it

Obtaining an FID above 10.15 or an IS below 196.73 on ImageNet 256x256 with 10-step GMM sampling would falsify the reported improvement over the Gaussian baseline.

Figures

Figures reproduced from arXiv: 2311.04938 by Prasad Gabbur.

Figure 1
Figure 1. Figure 1: CelebAHQ (top) and FFHQ (bottom). FID (↓). The horizontal line is the DDPM baseline run for 1000 steps. where we have made use of Eq. 18 to arrive at Eq. 22. In general, the true data distribution q(x0) is multimodal (e.g. Eq. 13) implying a complex multimodal form of the true denoising distribution. Using the same arguments as Sec. 3.2, RFIM-GMM kernels enable more flexibility in modeling q(xti−1 |xti ) r… view at source ↗
Figure 2
Figure 2. Figure 2: Class-conditional ImageNet with Classifier-free Guidance. FID (↓) (top) and IS (↑) (bottom) on the 50k ImageNet validation set. Classifier-free guidance with a scale of 2.5 is used during inference. The horizontal line is the DDPM baseline run for 1000 steps. parameters using a suitable objective (Watson et al., 2021; Mathiasen & Hvilshøj, 2021) on the training set. We also run full DDPM sampling for 1000 … view at source ↗
Figure 3
Figure 3. Figure 3: Text-to-Image Generation. FID (↓) on a 30k subset of COYO-700M using the Stable Diffusion v2.1 model. Classifier-free guidance with a scale of 7.5 is used during inference.    [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Text-to-Image Generation. IS (↑) on a 30k subset of COYO-700M using the Stable Diffusion v2.1 model. Classifier-free guidance with a scale of 7.5 is used during inference. of the true denoiser conditional distribution (q(xt−1|xt)) is modeled equally well by both the samplers (Guo et al., 2023; Xiao et al., 2022). See Appendix A.6 for additional results using classifier-guidance. Also, Appendix A.13 shows s… view at source ↗
Figure 5
Figure 5. Figure 5: Class-conditional ImageNet with Classifier Guidance. FID (↓) with guidance scale of 1.0 (top) and 10.0 (bottom) respectively. The horizontal line is the DDPM baseline run for 1000 steps.     [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Class-conditional ImageNet with Classifier Guidance. IS (↑) with guidance scale of 1.0 (top) and 10.0 (bottom) respectively. The horizontal line is the DDPM baseline run for 1000 steps. A.7 Class-conditional ImageNet with Classifier-free Guidance In addition to the results reported in Section 5.2, we conduct experiments using a higher classifier-free guidance scale of 5.0. The FID and IS metrics are shown … view at source ↗
Figure 7
Figure 7. Figure 7: Class-conditional ImageNet with Classifier-free Guidance. FID (↓) (top) and IS (↑) (bottom) respectively with a guidance scale of 5.0. The horizontal line is the DDPM baseline run for 1000 steps. A.8 Ablations In this section, we perform an ablative study on the number of mixture components and offset scaling factor s of the GMM parameters using the unconditional model trained on the CelebAHQ dataset as de… view at source ↗
Figure 8
Figure 8. Figure 8: CelebAHQ. FID (↓). Ablations on the number of mixture components (top) and offset scaling factor s (bottom). true denoising distributions are multimodal at fewer sampling steps and larger exploration (higher s) with a multimodal kernel is favorable. This advantage vanishes as the number of sampling steps increase. At the highest η(= 1), we hypothesize that s = 10 reduces the offset variances (diag_approx(σ… view at source ↗
Figure 9
Figure 9. Figure 9: Class-conditional ImageNet with classifier guidance, 10 sampling steps. Random samples from the class-conditional ImageNet model using DDIM (left) and DDIM-GMM (right) sampler conditioned on the class labels pelican (top) and cairn terrier (bottom) respectively. 10 sampling steps are used for each sampler with a classifier guidance weight of 10 (η = 1). A.13 Qualitative Results In this section we show some… view at source ↗
Figure 10
Figure 10. Figure 10: Class-conditional ImageNet with classifier-free guidance, 10 sampling steps. Random samples from the class-conditional ImageNet model using DDIM (left) and DDIM-GMM (right) sampler conditioned on the class labels pelican (top) and cairn terrier (bottom) respectively. 10 sampling steps are used for each sampler with a classifier free guidance weight of 5 (η = 0). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_… view at source ↗
Figure 11
Figure 11. Figure 11: Text-to-image-generation using Stable Diffusion v2.1, 10 sampling steps. Random samples from the Stable Diffusion v2.1 model using DDIM (left) and DDIM-GMM (right) sampler conditioned on text prompts, displayed below each image. 10 sampling steps are used for each sampler with a classifier free guidance weight of 7.5 (η = 0). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
read the original abstract

We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code: https://github.com/pgabbur/ddim-gmm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes replacing the standard Gaussian reverse transition kernel in DDIM with a Gaussian mixture model (GMM) whose component parameters are constrained to exactly match the first and second central moments of the known closed-form DDPM forward marginals. It claims that this moment-matched GMM yields equal or better sample quality than the original DDIM Gaussian kernel, with the largest gains appearing in the low-step regime; supporting evidence consists of FID and IS improvements across unconditional models on CelebA-HQ/FFHQ, class-conditional ImageNet models, Stable Diffusion v2.1 on COYO-700M, and extensions to 1- and 2-rectified flow models.

Significance. If the empirical gains are reproducible and the moment-matching construction is shown to be responsible, the result would supply a simple, training-free modification that improves few-step diffusion sampling quality. The breadth of datasets and model families tested is a positive feature; however, the absence of any argument that the GMM's higher-order moments improve (rather than degrade) the approximation to the true DDPM posterior limits the result's theoretical grounding and generality.

major comments (3)
  1. [§3] §3 (Moment-matching construction): the paper constrains GMM parameters to reproduce the DDPM marginal mean and variance but supplies no derivation or bound demonstrating that the resulting reverse kernel is a closer proxy to p(x_{t-1}|x_t) than the matched-moment Gaussian; because a Gaussian is fully specified by its first two moments, any improvement must arise from the GMM's higher-order moments, yet no analysis addresses when those moments help versus harm the reverse-process approximation.
  2. [§4] §4 (Experiments, ImageNet 256×256 row): the reported 10-step FID drop from 10.15 (Gaussian) to 6.94 (GMM) and IS rise from 196.73 to 207.85 are presented without error bars, multiple random seeds, or statistical significance tests, so it is impossible to determine whether the observed gains exceed run-to-run variability.
  3. [§4.2] §4.2 (Ablations): no experiment varies the number of mixture components K or reports performance as a function of K; without this control it remains unclear whether the claimed benefit is attributable to the GMM structure itself or to a particular, unreported choice of K.
minor comments (2)
  1. [Abstract] The abstract states that novel SDE samplers are derived for rectified-flow models, yet the main text provides only a brief description of the adaptation; a short appendix derivation would improve clarity.
  2. [§4] Code repository is linked but the manuscript does not state which exact hyper-parameters (e.g., number of mixture components, optimizer settings for any auxiliary fitting) were used to produce the tabulated numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, outlining planned revisions where appropriate while defending the empirical contributions of the moment-matched GMM approach.

read point-by-point responses
  1. Referee: [§3] §3 (Moment-matching construction): the paper constrains GMM parameters to reproduce the DDPM marginal mean and variance but supplies no derivation or bound demonstrating that the resulting reverse kernel is a closer proxy to p(x_{t-1}|x_t) than the matched-moment Gaussian; because a Gaussian is fully specified by its first two moments, any improvement must arise from the GMM's higher-order moments, yet no analysis addresses when those moments help versus harm the reverse-process approximation.

    Authors: We agree that the manuscript lacks a formal derivation or error bound proving the GMM kernel is a strictly closer approximation to the true DDPM posterior. The construction prioritizes exact first- and second-moment matching (which a Gaussian also satisfies), with any gains necessarily arising from higher-order moments captured by the mixture. We will add a dedicated discussion paragraph in §3 of the revision that (i) explicitly states the contribution is primarily empirical, (ii) notes that the true reverse posterior can be non-Gaussian and multimodal, and (iii) acknowledges the absence of a general bound on when higher moments help versus harm as a limitation of the current analysis. revision: partial

  2. Referee: [§4] §4 (Experiments, ImageNet 256×256 row): the reported 10-step FID drop from 10.15 (Gaussian) to 6.94 (GMM) and IS rise from 196.73 to 207.85 are presented without error bars, multiple random seeds, or statistical significance tests, so it is impossible to determine whether the observed gains exceed run-to-run variability.

    Authors: We acknowledge that single-run reporting limits interpretability. In the revised manuscript we will re-execute the ImageNet 256×256 10-step experiments across at least three independent random seeds, report mean FID/IS with standard deviations, and include a brief statistical comparison (e.g., paired t-test or overlap of confidence intervals) to confirm the observed gains are not attributable to run-to-run variability. revision: yes

  3. Referee: [§4.2] §4.2 (Ablations): no experiment varies the number of mixture components K or reports performance as a function of K; without this control it remains unclear whether the claimed benefit is attributable to the GMM structure itself or to a particular, unreported choice of K.

    Authors: We will add an ablation subsection (new §4.3) that sweeps K from 1 to 5 on the primary datasets (CelebA-HQ, ImageNet) at the low-step regimes where gains were largest, plotting FID and IS versus K. This will demonstrate that performance improves for K>1 relative to the K=1 (Gaussian) baseline and plateaus or degrades for very large K, thereby attributing the benefit to the mixture structure rather than a specific unreported K. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external DDPM moments and external metrics

full rationale

The paper defines the GMM reverse kernel by constraining its parameters to match the known closed-form first and second moments of the DDPM forward marginals (a pre-existing result independent of this work). The claimed improvements are then measured by FID and IS on held-out generated samples, which are not used to set any GMM parameters. No step reduces a prediction to a fitted input by construction, no self-citation is load-bearing, and no ansatz or uniqueness claim is smuggled in. The central construction is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method assumes the forward marginal statistics are known exactly and that a GMM can be parameterized to match them while remaining a valid reverse kernel; no new entities are postulated.

axioms (1)
  • domain assumption The first two moments of the DDPM forward marginal at each timestep are known in closed form and can be used to constrain GMM parameters.
    Invoked to set the GMM means, covariances, and weights without additional fitting.

pith-pipeline@v0.9.0 · 5791 in / 1208 out tokens · 39744 ms · 2026-05-24T05:32:24.092837+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 8 internal anchors

  1. [1]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    URL https://arxiv.org/abs/2303.08797. Michael Samuel Albergo and Eric Vanden-Eijnden. Building n ormalizing flows with stochastic in- terpolants. In The Eleventh International Conference on Learning Represe ntations,

  2. [2]

    Building Normalizing Flows with Stochastic Interpolants

    URL https://arxiv.org/abs/2209.15571. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Sci ence and Statistics) . Springer-Verlag, Berlin, Heidelberg,

  3. [3]

    Dens ity estimation using real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Dens ity estimation using real NVP. In 5th In- ternational Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings,

  4. [4]

    Classifier-free diffusion guid ance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guid ance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications ,

  5. [5]

    Denoising diffusi on probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusi on probabilistic models. In Advances in Neural Information Processing Systems 33: Annual Conference on Ne ural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual ,

  6. [6]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458,

  7. [7]

    Gotta go fast when generating data with score-based models,

    Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080 ,

  8. [8]

    A style-based gene rator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based gene rator architecture for generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition , CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pp. 4401–4410. Computer Vision Foundation / IEEE,

  9. [9]

    Consistency traject ory models: Learning probability flow ode trajectory of diffusion

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata , Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency traject ory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279 ,

  10. [10]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding variatio nal bayes. In Yoshua Bengio and Yann LeCun (eds.), 2nd International Conference on Learning Representations , ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings ,

  11. [11]

    Diffwave: A versatile diffusion model for audio synthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Ca tanzaro. Diffwave: A versatile diffusion model for audio synthesis. In 9th International Conference on Learning Representations , ICLR 2021, Virtual Event, Austria, May 3-7, 2021 ,

  12. [12]

    Improving the trai ning of rectified flows

    Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the trai ning of rectified flows. In Advances in Neural Information Processing Systems , volume 37, pp. 63082–63109. Curran Associates, Inc., 2024 . Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Ni ckel, Matthew Le, Brian Karrer, David Lopez-Paz, and Itai Gat. Flow matching for generative model...

  13. [13]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 step s

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, an d Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 step s. arXiv preprint arXiv:2206.00927 ,

  14. [14]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, an d Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 ,

  15. [15]

    arXiv preprint arXiv:2208.11970 , year=

    Calvin Luo. Understanding diffusion models: A unified perspe ctive. ArXiv, abs/2208.11970,

  16. [16]

    Backpropagati ng through fréchet inception distance

    Alexander Mathiasen and Frederik Hvilshøj. Backpropagati ng through fréchet inception distance. arXiv preprint arXiv:2009.14075,

  17. [17]

    Non gaussian denoising diffusion models

    Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diffusion models. arXiv preprint arXiv:2106.07582,

  18. [18]

    Scalable Diffusion Models with Transformers

    14 William Peebles and Saining Xie. Scalable diffusion models w ith transformers. arXiv preprint arXiv:2212.09748,

  19. [19]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In The Tenth International Conference on Learning Representations, IC LR 2022, Virtual Event, April 25-29, 2022 ,

  20. [20]

    Noise estim ation for generative diffusion models

    Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estim ation for generative diffusion models. arXiv preprint arXiv:2104.02600,

  21. [21]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingm a, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochasti c differential equations. ArXiv, abs/2011.13456,

  22. [22]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever . Consistency models. arXiv preprint arXiv:2303.01469,

  23. [23]

    Novel view synthesis with diffusion models

    Daniel Watson, William Chan, Ricardo Martin-Brualla, Jona than Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. ArXiv, abs/2210.04628,

  24. [24]

    LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

    Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong X iao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 ,

  25. [25]

    8 follows by induction (S ong et al., 2021)

    A Appendix A.1 Proof of Constraints on GMM Parameters Our proof for the constraints in Eq. 8 follows by induction (S ong et al., 2021). The marginal of xT is already equal to the DDPM marginal at step T by definition (Eq. 19). We show below that the marginals of all the random variables xt, t < T are Gaussian mixtures with their first and second order momen...

  26. [26]

    xT − √αT x0√1 − αT + δ k T , σ2 T I − ∆k T ) qσ,M(xT |x0) dxT = K∑ k=1 πk T ∫ xT N (√ αT −1x0 + √ 1 − αT −1 − σ2 T

    Using Bayes’ rule, the marginal at xT −1 is given by qσ,M(xT −1|x0) = ∫ xT qσ,M(xT −1|xT , x0)qσ,M(xT |x0) dxT = ∫ xT K∑ k=1 πk T N (√αT −1x0 + √ 1 − αT −1 − σ2 T . xT − √αT x0√1 − αT + δ k T , σ2 T I − ∆k T ) qσ,M(xT |x0) dxT = K∑ k=1 πk T ∫ xT N (√ αT −1x0 + √ 1 − αT −1 − σ2 T . xT − √αT x0√1 − αT + δ k T , σ2 T I − ∆k T ) N (√ αT x0, (1 − αT )I) d xT ,...

  27. [27]

    Note that the inference process of the propos ed implicit model is non-Gaussian and non-Markovian in general and diffe rent from Gaussian diffusion

    and GM M(t) are the marginal GMMs at steps t − 1 and t respectively. Note that the inference process of the propos ed implicit model is non-Gaussian and non-Markovian in general and diffe rent from Gaussian diffusion. A.4 Upper Bound of ELBO using the DDIM-GMM Inference Process In this section we provide an upper bound for the ELBO loss usi ng the proposed ...

  28. [28]

    as a surrog ate for optimization. Assuming that there is a one-to-one correspondence between the mixture compone nts of the GMMs using the true and estimated value of x0 above, we can use the matched bound (Hershey & Olsen, 2007; Do ,

  29. [29]

    All our diffusion models are t rained in the latent space of a VQV AE (Rombach et al., 2022)

    A.5 Experimental Details We provide additional details on the experiments reported i n Section 5, specifically for the CelebAHQ, FFHQ and ImageNet experiments. All our diffusion models are t rained in the latent space of a VQV AE (Rombach et al., 2022). The input images to the VQV AE are at a r esolution of 256x256 pixels. Each of the VQV AEs are trained on...

  30. [30]

    All our diffusion models are trained with 1000 forward steps u sing a linear noise ( βt = 1 − αt αt−1 ) schedule of [ β0 = 0 .0015, β1000 = 0 .0195]

    We train a f 4 VQV AE (#embeddings=8192), with no attention layers, on ImageNet for 712k steps and use its late nt space to train the diffusion models on FFHQ. All our diffusion models are trained with 1000 forward steps u sing a linear noise ( βt = 1 − αt αt−1 ) schedule of [ β0 = 0 .0015, β1000 = 0 .0195]. We use the U-Net architecture (Ho et al., 2020; R...

  31. [31]

    10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η = 0.0 DDIM DDIM- GMM-R AND DDIM- GMM- OR THO DDIM- GMM- OR THO - VUB 10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η =

    and poses chal lenges for higher order ODE solvers (Lu et al., 2023). 10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η = 0.0 DDIM DDIM- GMM-R AND DDIM- GMM- OR THO DDIM- GMM- OR THO - VUB 10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η =

  32. [32]

    We use the GMM-ORTHO-VUB sampler and fix the mixtur e weights of the components to be uniform in all these experiments. A.8.1 Number of mixture components We compute the FID on the validation set by choosing one of 8, 2 56, or 1024 components ( n) at each step during sampling, using different values of η. For each choice of n, we select the scale s among (0...