Improved DDIM Sampling with Moment Matching Gaussian Mixtures

Prasad Gabbur

arxiv: 2311.04938 · v5 · pith:H7OYZE3Mnew · submitted 2023-11-08 · 💻 cs.CV · cs.AI· cs.LG

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

Prasad Gabbur This is my paper

Pith reviewed 2026-05-24 05:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords DDIMGaussian mixture modelmoment matchingdiffusion modelsaccelerated samplingFIDISrectified flow

0 comments

The pith

Moment-matched Gaussian mixture kernels improve DDIM sample quality at small step counts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces the Gaussian reverse kernel inside DDIM with a Gaussian mixture model whose parameters are set so that the mixture matches the first two central moments of the DDPM forward marginals. Experiments on unconditional face models, class-conditional ImageNet models, Stable Diffusion text-to-image, and rectified flow models show that this change produces equal or higher FID and IS scores, with the largest gains at low step counts such as 10 steps. The result matters because it supplies a direct way to reduce the number of denoising steps while keeping or raising generation quality. A sympathetic reader would therefore see the GMM construction as a lightweight upgrade to an already popular accelerated sampler.

Core claim

Constraining the parameters of a Gaussian mixture model to match the first and second central moments of the DDPM forward marginals yields a reverse transition kernel that, inside the DDIM framework, produces samples whose quality is at least as good as and often better than the quality obtained with the original Gaussian kernel, with the advantage most visible when the number of sampling steps is small.

What carries the argument

The moment-constrained GMM reverse kernel that replaces the Gaussian transition operator in DDIM.

If this is right

Ten-step sampling on ImageNet 256x256 reaches FID 6.94 and IS 207.85 with the GMM kernel versus 10.15 and 196.73 with the Gaussian kernel.
Quality gains appear on CelebAHQ, FFHQ, and Stable Diffusion text-to-image models.
The same moment-matching GMM improves sampling from both 1-rectified and 2-rectified flow models.
First- and second-order moment matching alone is presented as sufficient for the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Higher-order moments of the forward marginals may not be required for competitive reverse sampling.
The GMM kernel could be inserted into other non-Markovian diffusion samplers beyond DDIM.
Similar moment-matching constructions might apply to continuous-time diffusion SDEs.
Fewer sampling steps would lower the inference compute needed for high-quality image generation.

Load-bearing premise

That matching only the first and second central moments of the DDPM forward marginals is enough to obtain equal or better sample quality than the original Gaussian DDIM kernel.

What would settle it

Obtaining an FID above 10.15 or an IS below 196.73 on ImageNet 256x256 with 10-step GMM sampling would falsify the reported improvement over the Gaussian baseline.

Figures

Figures reproduced from arXiv: 2311.04938 by Prasad Gabbur.

**Figure 1.** Figure 1: CelebAHQ (top) and FFHQ (bottom). FID (↓). The horizontal line is the DDPM baseline run for 1000 steps. where we have made use of Eq. 18 to arrive at Eq. 22. In general, the true data distribution q(x0) is multimodal (e.g. Eq. 13) implying a complex multimodal form of the true denoising distribution. Using the same arguments as Sec. 3.2, RFIM-GMM kernels enable more flexibility in modeling q(xti−1 |xti ) r… view at source ↗

**Figure 2.** Figure 2: Class-conditional ImageNet with Classifier-free Guidance. FID (↓) (top) and IS (↑) (bottom) on the 50k ImageNet validation set. Classifier-free guidance with a scale of 2.5 is used during inference. The horizontal line is the DDPM baseline run for 1000 steps. parameters using a suitable objective (Watson et al., 2021; Mathiasen & Hvilshøj, 2021) on the training set. We also run full DDPM sampling for 1000 … view at source ↗

**Figure 3.** Figure 3: Text-to-Image Generation. FID (↓) on a 30k subset of COYO-700M using the Stable Diffusion v2.1 model. Classifier-free guidance with a scale of 7.5 is used during inference. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Text-to-Image Generation. IS (↑) on a 30k subset of COYO-700M using the Stable Diffusion v2.1 model. Classifier-free guidance with a scale of 7.5 is used during inference. of the true denoiser conditional distribution (q(xt−1|xt)) is modeled equally well by both the samplers (Guo et al., 2023; Xiao et al., 2022). See Appendix A.6 for additional results using classifier-guidance. Also, Appendix A.13 shows s… view at source ↗

**Figure 5.** Figure 5: Class-conditional ImageNet with Classifier Guidance. FID (↓) with guidance scale of 1.0 (top) and 10.0 (bottom) respectively. The horizontal line is the DDPM baseline run for 1000 steps. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Class-conditional ImageNet with Classifier Guidance. IS (↑) with guidance scale of 1.0 (top) and 10.0 (bottom) respectively. The horizontal line is the DDPM baseline run for 1000 steps. A.7 Class-conditional ImageNet with Classifier-free Guidance In addition to the results reported in Section 5.2, we conduct experiments using a higher classifier-free guidance scale of 5.0. The FID and IS metrics are shown … view at source ↗

**Figure 7.** Figure 7: Class-conditional ImageNet with Classifier-free Guidance. FID (↓) (top) and IS (↑) (bottom) respectively with a guidance scale of 5.0. The horizontal line is the DDPM baseline run for 1000 steps. A.8 Ablations In this section, we perform an ablative study on the number of mixture components and offset scaling factor s of the GMM parameters using the unconditional model trained on the CelebAHQ dataset as de… view at source ↗

**Figure 8.** Figure 8: CelebAHQ. FID (↓). Ablations on the number of mixture components (top) and offset scaling factor s (bottom). true denoising distributions are multimodal at fewer sampling steps and larger exploration (higher s) with a multimodal kernel is favorable. This advantage vanishes as the number of sampling steps increase. At the highest η(= 1), we hypothesize that s = 10 reduces the offset variances (diag_approx(σ… view at source ↗

**Figure 9.** Figure 9: Class-conditional ImageNet with classifier guidance, 10 sampling steps. Random samples from the class-conditional ImageNet model using DDIM (left) and DDIM-GMM (right) sampler conditioned on the class labels pelican (top) and cairn terrier (bottom) respectively. 10 sampling steps are used for each sampler with a classifier guidance weight of 10 (η = 1). A.13 Qualitative Results In this section we show some… view at source ↗

**Figure 10.** Figure 10: Class-conditional ImageNet with classifier-free guidance, 10 sampling steps. Random samples from the class-conditional ImageNet model using DDIM (left) and DDIM-GMM (right) sampler conditioned on the class labels pelican (top) and cairn terrier (bottom) respectively. 10 sampling steps are used for each sampler with a classifier free guidance weight of 5 (η = 0). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_… view at source ↗

**Figure 11.** Figure 11: Text-to-image-generation using Stable Diffusion v2.1, 10 sampling steps. Random samples from the Stable Diffusion v2.1 model using DDIM (left) and DDIM-GMM (right) sampler conditioned on text prompts, displayed below each image. 10 sampling steps are used for each sampler with a classifier free guidance weight of 7.5 (η = 0). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

read the original abstract

We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code: https://github.com/pgabbur/ddim-gmm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces DDIM's Gaussian kernel with a first-and-second-moment-matched GMM and reports FID/IS gains at low step counts, but supplies no reason the extra mixture degrees of freedom should improve the reverse approximation.

read the letter

The new element is the use of a GMM reverse kernel inside DDIM whose parameters are fixed by matching the known mean and variance of the DDPM forward marginals. The authors test this on unconditional CelebA-HQ and FFHQ models, class-conditional ImageNet, and Stable Diffusion, plus some rectified-flow extensions, and show the largest gains when the number of sampling steps is small (e.g., the 10-step ImageNet numbers). The code is released, which helps reproducibility checks later. That is the concrete contribution. The central assumption is that any GMM sharing the same first two moments will produce at least as good trajectories as the Gaussian that already matches those moments exactly. Nothing in the abstract or the reported experiments explains why the higher-order differences introduced by the mixture should be helpful rather than neutral or worse in the few-step regime. The experiments lack error bars, any sweep on the number of mixture components, and any verification that the moment constraints hold in the released code. Those gaps make the result look preliminary rather than settled. This is the sort of practical sampling tweak that people who already run diffusion models at inference time might want to try, but it is not yet strong enough to change how most groups implement DDIM. I would send it to review if the authors add at least an ablation on component count and some analysis of the approximation error, but on current evidence it is not ready for a strong citation.

Referee Report

3 major / 2 minor

Summary. The paper proposes replacing the standard Gaussian reverse transition kernel in DDIM with a Gaussian mixture model (GMM) whose component parameters are constrained to exactly match the first and second central moments of the known closed-form DDPM forward marginals. It claims that this moment-matched GMM yields equal or better sample quality than the original DDIM Gaussian kernel, with the largest gains appearing in the low-step regime; supporting evidence consists of FID and IS improvements across unconditional models on CelebA-HQ/FFHQ, class-conditional ImageNet models, Stable Diffusion v2.1 on COYO-700M, and extensions to 1- and 2-rectified flow models.

Significance. If the empirical gains are reproducible and the moment-matching construction is shown to be responsible, the result would supply a simple, training-free modification that improves few-step diffusion sampling quality. The breadth of datasets and model families tested is a positive feature; however, the absence of any argument that the GMM's higher-order moments improve (rather than degrade) the approximation to the true DDPM posterior limits the result's theoretical grounding and generality.

major comments (3)

[§3] §3 (Moment-matching construction): the paper constrains GMM parameters to reproduce the DDPM marginal mean and variance but supplies no derivation or bound demonstrating that the resulting reverse kernel is a closer proxy to p(x_{t-1}|x_t) than the matched-moment Gaussian; because a Gaussian is fully specified by its first two moments, any improvement must arise from the GMM's higher-order moments, yet no analysis addresses when those moments help versus harm the reverse-process approximation.
[§4] §4 (Experiments, ImageNet 256×256 row): the reported 10-step FID drop from 10.15 (Gaussian) to 6.94 (GMM) and IS rise from 196.73 to 207.85 are presented without error bars, multiple random seeds, or statistical significance tests, so it is impossible to determine whether the observed gains exceed run-to-run variability.
[§4.2] §4.2 (Ablations): no experiment varies the number of mixture components K or reports performance as a function of K; without this control it remains unclear whether the claimed benefit is attributable to the GMM structure itself or to a particular, unreported choice of K.

minor comments (2)

[Abstract] The abstract states that novel SDE samplers are derived for rectified-flow models, yet the main text provides only a brief description of the adaptation; a short appendix derivation would improve clarity.
[§4] Code repository is linked but the manuscript does not state which exact hyper-parameters (e.g., number of mixture components, optimizer settings for any auxiliary fitting) were used to produce the tabulated numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, outlining planned revisions where appropriate while defending the empirical contributions of the moment-matched GMM approach.

read point-by-point responses

Referee: [§3] §3 (Moment-matching construction): the paper constrains GMM parameters to reproduce the DDPM marginal mean and variance but supplies no derivation or bound demonstrating that the resulting reverse kernel is a closer proxy to p(x_{t-1}|x_t) than the matched-moment Gaussian; because a Gaussian is fully specified by its first two moments, any improvement must arise from the GMM's higher-order moments, yet no analysis addresses when those moments help versus harm the reverse-process approximation.

Authors: We agree that the manuscript lacks a formal derivation or error bound proving the GMM kernel is a strictly closer approximation to the true DDPM posterior. The construction prioritizes exact first- and second-moment matching (which a Gaussian also satisfies), with any gains necessarily arising from higher-order moments captured by the mixture. We will add a dedicated discussion paragraph in §3 of the revision that (i) explicitly states the contribution is primarily empirical, (ii) notes that the true reverse posterior can be non-Gaussian and multimodal, and (iii) acknowledges the absence of a general bound on when higher moments help versus harm as a limitation of the current analysis. revision: partial
Referee: [§4] §4 (Experiments, ImageNet 256×256 row): the reported 10-step FID drop from 10.15 (Gaussian) to 6.94 (GMM) and IS rise from 196.73 to 207.85 are presented without error bars, multiple random seeds, or statistical significance tests, so it is impossible to determine whether the observed gains exceed run-to-run variability.

Authors: We acknowledge that single-run reporting limits interpretability. In the revised manuscript we will re-execute the ImageNet 256×256 10-step experiments across at least three independent random seeds, report mean FID/IS with standard deviations, and include a brief statistical comparison (e.g., paired t-test or overlap of confidence intervals) to confirm the observed gains are not attributable to run-to-run variability. revision: yes
Referee: [§4.2] §4.2 (Ablations): no experiment varies the number of mixture components K or reports performance as a function of K; without this control it remains unclear whether the claimed benefit is attributable to the GMM structure itself or to a particular, unreported choice of K.

Authors: We will add an ablation subsection (new §4.3) that sweeps K from 1 to 5 on the primary datasets (CelebA-HQ, ImageNet) at the low-step regimes where gains were largest, plotting FID and IS versus K. This will demonstrate that performance improves for K>1 relative to the K=1 (Gaussian) baseline and plateaus or degrades for very large K, thereby attributing the benefit to the mixture structure rather than a specific unreported K. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external DDPM moments and external metrics

full rationale

The paper defines the GMM reverse kernel by constraining its parameters to match the known closed-form first and second moments of the DDPM forward marginals (a pre-existing result independent of this work). The claimed improvements are then measured by FID and IS on held-out generated samples, which are not used to set any GMM parameters. No step reduces a prediction to a fitted input by construction, no self-citation is load-bearing, and no ansatz or uniqueness claim is smuggled in. The central construction is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method assumes the forward marginal statistics are known exactly and that a GMM can be parameterized to match them while remaining a valid reverse kernel; no new entities are postulated.

axioms (1)

domain assumption The first two moments of the DDPM forward marginal at each timestep are known in closed form and can be used to constrain GMM parameters.
Invoked to set the GMM means, covariances, and weights without additional fitting.

pith-pipeline@v0.9.0 · 5791 in / 1208 out tokens · 39744 ms · 2026-05-24T05:32:24.092837+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 8 internal anchors

[1]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

URL https://arxiv.org/abs/2303.08797. Michael Samuel Albergo and Eric Vanden-Eijnden. Building n ormalizing ﬂows with stochastic in- terpolants. In The Eleventh International Conference on Learning Represe ntations,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Building Normalizing Flows with Stochastic Interpolants

URL https://arxiv.org/abs/2209.15571. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Sci ence and Statistics) . Springer-Verlag, Berlin, Heidelberg,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Dens ity estimation using real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Dens ity estimation using real NVP. In 5th In- ternational Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings,

work page 2017
[4]

Classiﬁer-free diﬀusion guid ance

Jonathan Ho and Tim Salimans. Classiﬁer-free diﬀusion guid ance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications ,

work page 2021
[5]

Denoising diﬀusi on probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diﬀusi on probabilistic models. In Advances in Neural Information Processing Systems 33: Annual Conference on Ne ural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual ,

work page 2020
[6]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diﬀusion models. arXiv:2204.03458,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Gotta go fast when generating data with score-based models,

Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080 ,

work page arXiv
[8]

A style-based gene rator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based gene rator architecture for generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition , CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pp. 4401–4410. Computer Vision Foundation / IEEE,

work page 2019
[9]

Consistency traject ory models: Learning probability ﬂow ode trajectory of diﬀusion

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata , Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency traject ory models: Learning probability ﬂow ode trajectory of diﬀusion. arXiv preprint arXiv:2310.02279 ,

work page arXiv
[10]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variatio nal bayes. In Yoshua Bengio and Yann LeCun (eds.), 2nd International Conference on Learning Representations , ICLR 2014, Banﬀ, AB, Canada, April 14-16, 2014, Conference Track Proceedings ,

work page 2014
[11]

Diﬀwave: A versatile diﬀusion model for audio synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Ca tanzaro. Diﬀwave: A versatile diﬀusion model for audio synthesis. In 9th International Conference on Learning Representations , ICLR 2021, Virtual Event, Austria, May 3-7, 2021 ,

work page 2021
[12]

Improving the trai ning of rectiﬁed ﬂows

Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the trai ning of rectiﬁed ﬂows. In Advances in Neural Information Processing Systems , volume 37, pp. 63082–63109. Curran Associates, Inc., 2024 . Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Ni ckel, Matthew Le, Brian Karrer, David Lopez-Paz, and Itai Gat. Flow matching for generative model...

work page 2024
[13]

Dpm-solver: A fast ode solver for diﬀusion probabilistic model sampling in around 10 step s

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, an d Jun Zhu. Dpm-solver: A fast ode solver for diﬀusion probabilistic model sampling in around 10 step s. arXiv preprint arXiv:2206.00927 ,

work page arXiv
[14]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, an d Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diﬀusion probabilistic models. arXiv preprint arXiv:2211.01095 ,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2208.11970 , year=

Calvin Luo. Understanding diﬀusion models: A uniﬁed perspe ctive. ArXiv, abs/2208.11970,

work page arXiv
[16]

Backpropagati ng through fréchet inception distance

Alexander Mathiasen and Frederik Hvilshøj. Backpropagati ng through fréchet inception distance. arXiv preprint arXiv:2009.14075,

work page arXiv 2009
[17]

Non gaussian denoising diﬀusion models

Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diﬀusion models. arXiv preprint arXiv:2106.07582,

work page arXiv
[18]

Scalable Diffusion Models with Transformers

14 William Peebles and Saining Xie. Scalable diﬀusion models w ith transformers. arXiv preprint arXiv:2212.09748,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Progressive distillation for fast sampling of diﬀusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diﬀusion models. In The Tenth International Conference on Learning Representations, IC LR 2022, Virtual Event, April 25-29, 2022 ,

work page 2022
[20]

Noise estim ation for generative diﬀusion models

Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estim ation for generative diﬀusion models. arXiv preprint arXiv:2104.02600,

work page arXiv
[21]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingm a, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochasti c diﬀerential equations. ArXiv, abs/2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[22]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever . Consistency models. arXiv preprint arXiv:2303.01469,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Novel view synthesis with diﬀusion models

Daniel Watson, William Chan, Ricardo Martin-Brualla, Jona than Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diﬀusion models. ArXiv, abs/2210.04628,

work page arXiv
[24]

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

Fisher Yu, Yinda Zhang, Shuran Song, Ari Seﬀ, and Jianxiong X iao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 ,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

8 follows by induction (S ong et al., 2021)

A Appendix A.1 Proof of Constraints on GMM Parameters Our proof for the constraints in Eq. 8 follows by induction (S ong et al., 2021). The marginal of xT is already equal to the DDPM marginal at step T by deﬁnition (Eq. 19). We show below that the marginals of all the random variables xt, t < T are Gaussian mixtures with their ﬁrst and second order momen...

work page 2021
[26]

xT − √αT x0√1 − αT + δ k T , σ2 T I − ∆k T ) qσ,M(xT |x0) dxT = K∑ k=1 πk T ∫ xT N (√ αT −1x0 + √ 1 − αT −1 − σ2 T

Using Bayes’ rule, the marginal at xT −1 is given by qσ,M(xT −1|x0) = ∫ xT qσ,M(xT −1|xT , x0)qσ,M(xT |x0) dxT = ∫ xT K∑ k=1 πk T N (√αT −1x0 + √ 1 − αT −1 − σ2 T . xT − √αT x0√1 − αT + δ k T , σ2 T I − ∆k T ) qσ,M(xT |x0) dxT = K∑ k=1 πk T ∫ xT N (√ αT −1x0 + √ 1 − αT −1 − σ2 T . xT − √αT x0√1 − αT + δ k T , σ2 T I − ∆k T ) N (√ αT x0, (1 − αT )I) d xT ,...

work page 2006
[27]

Note that the inference process of the propos ed implicit model is non-Gaussian and non-Markovian in general and diﬀe rent from Gaussian diﬀusion

and GM M(t) are the marginal GMMs at steps t − 1 and t respectively. Note that the inference process of the propos ed implicit model is non-Gaussian and non-Markovian in general and diﬀe rent from Gaussian diﬀusion. A.4 Upper Bound of ELBO using the DDIM-GMM Inference Process In this section we provide an upper bound for the ELBO loss usi ng the proposed ...

work page 2021
[28]

as a surrog ate for optimization. Assuming that there is a one-to-one correspondence between the mixture compone nts of the GMMs using the true and estimated value of x0 above, we can use the matched bound (Hershey & Olsen, 2007; Do ,

work page 2007
[29]

All our diﬀusion models are t rained in the latent space of a VQV AE (Rombach et al., 2022)

A.5 Experimental Details We provide additional details on the experiments reported i n Section 5, speciﬁcally for the CelebAHQ, FFHQ and ImageNet experiments. All our diﬀusion models are t rained in the latent space of a VQV AE (Rombach et al., 2022). The input images to the VQV AE are at a r esolution of 256x256 pixels. Each of the VQV AEs are trained on...

work page 2022
[30]

All our diﬀusion models are trained with 1000 forward steps u sing a linear noise ( βt = 1 − αt αt−1 ) schedule of [ β0 = 0 .0015, β1000 = 0 .0195]

We train a f 4 VQV AE (#embeddings=8192), with no attention layers, on ImageNet for 712k steps and use its late nt space to train the diﬀusion models on FFHQ. All our diﬀusion models are trained with 1000 forward steps u sing a linear noise ( βt = 1 − αt αt−1 ) schedule of [ β0 = 0 .0015, β1000 = 0 .0195]. We use the U-Net architecture (Ho et al., 2020; R...

work page 2020
[31]

10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η = 0.0 DDIM DDIM- GMM-R AND DDIM- GMM- OR THO DDIM- GMM- OR THO - VUB 10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η =

and poses chal lenges for higher order ODE solvers (Lu et al., 2023). 10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η = 0.0 DDIM DDIM- GMM-R AND DDIM- GMM- OR THO DDIM- GMM- OR THO - VUB 10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η =

work page 2023
[32]

We use the GMM-ORTHO-VUB sampler and ﬁx the mixtur e weights of the components to be uniform in all these experiments. A.8.1 Number of mixture components We compute the FID on the validation set by choosing one of 8, 2 56, or 1024 components ( n) at each step during sampling, using diﬀerent values of η. For each choice of n, we select the scale s among (0...

work page 2023

[1] [1]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

URL https://arxiv.org/abs/2303.08797. Michael Samuel Albergo and Eric Vanden-Eijnden. Building n ormalizing ﬂows with stochastic in- terpolants. In The Eleventh International Conference on Learning Represe ntations,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Building Normalizing Flows with Stochastic Interpolants

URL https://arxiv.org/abs/2209.15571. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Sci ence and Statistics) . Springer-Verlag, Berlin, Heidelberg,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Dens ity estimation using real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Dens ity estimation using real NVP. In 5th In- ternational Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings,

work page 2017

[4] [4]

Classiﬁer-free diﬀusion guid ance

Jonathan Ho and Tim Salimans. Classiﬁer-free diﬀusion guid ance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications ,

work page 2021

[5] [5]

Denoising diﬀusi on probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diﬀusi on probabilistic models. In Advances in Neural Information Processing Systems 33: Annual Conference on Ne ural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual ,

work page 2020

[6] [6]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diﬀusion models. arXiv:2204.03458,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Gotta go fast when generating data with score-based models,

Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080 ,

work page arXiv

[8] [8]

A style-based gene rator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based gene rator architecture for generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition , CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pp. 4401–4410. Computer Vision Foundation / IEEE,

work page 2019

[9] [9]

Consistency traject ory models: Learning probability ﬂow ode trajectory of diﬀusion

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata , Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency traject ory models: Learning probability ﬂow ode trajectory of diﬀusion. arXiv preprint arXiv:2310.02279 ,

work page arXiv

[10] [10]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variatio nal bayes. In Yoshua Bengio and Yann LeCun (eds.), 2nd International Conference on Learning Representations , ICLR 2014, Banﬀ, AB, Canada, April 14-16, 2014, Conference Track Proceedings ,

work page 2014

[11] [11]

Diﬀwave: A versatile diﬀusion model for audio synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Ca tanzaro. Diﬀwave: A versatile diﬀusion model for audio synthesis. In 9th International Conference on Learning Representations , ICLR 2021, Virtual Event, Austria, May 3-7, 2021 ,

work page 2021

[12] [12]

Improving the trai ning of rectiﬁed ﬂows

Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the trai ning of rectiﬁed ﬂows. In Advances in Neural Information Processing Systems , volume 37, pp. 63082–63109. Curran Associates, Inc., 2024 . Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Ni ckel, Matthew Le, Brian Karrer, David Lopez-Paz, and Itai Gat. Flow matching for generative model...

work page 2024

[13] [13]

Dpm-solver: A fast ode solver for diﬀusion probabilistic model sampling in around 10 step s

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, an d Jun Zhu. Dpm-solver: A fast ode solver for diﬀusion probabilistic model sampling in around 10 step s. arXiv preprint arXiv:2206.00927 ,

work page arXiv

[14] [14]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, an d Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diﬀusion probabilistic models. arXiv preprint arXiv:2211.01095 ,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2208.11970 , year=

Calvin Luo. Understanding diﬀusion models: A uniﬁed perspe ctive. ArXiv, abs/2208.11970,

work page arXiv

[16] [16]

Backpropagati ng through fréchet inception distance

Alexander Mathiasen and Frederik Hvilshøj. Backpropagati ng through fréchet inception distance. arXiv preprint arXiv:2009.14075,

work page arXiv 2009

[17] [17]

Non gaussian denoising diﬀusion models

Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diﬀusion models. arXiv preprint arXiv:2106.07582,

work page arXiv

[18] [18]

Scalable Diffusion Models with Transformers

14 William Peebles and Saining Xie. Scalable diﬀusion models w ith transformers. arXiv preprint arXiv:2212.09748,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Progressive distillation for fast sampling of diﬀusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diﬀusion models. In The Tenth International Conference on Learning Representations, IC LR 2022, Virtual Event, April 25-29, 2022 ,

work page 2022

[20] [20]

Noise estim ation for generative diﬀusion models

Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estim ation for generative diﬀusion models. arXiv preprint arXiv:2104.02600,

work page arXiv

[21] [21]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingm a, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochasti c diﬀerential equations. ArXiv, abs/2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011

[22] [22]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever . Consistency models. arXiv preprint arXiv:2303.01469,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Novel view synthesis with diﬀusion models

Daniel Watson, William Chan, Ricardo Martin-Brualla, Jona than Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diﬀusion models. ArXiv, abs/2210.04628,

work page arXiv

[24] [24]

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

Fisher Yu, Yinda Zhang, Shuran Song, Ari Seﬀ, and Jianxiong X iao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 ,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

8 follows by induction (S ong et al., 2021)

A Appendix A.1 Proof of Constraints on GMM Parameters Our proof for the constraints in Eq. 8 follows by induction (S ong et al., 2021). The marginal of xT is already equal to the DDPM marginal at step T by deﬁnition (Eq. 19). We show below that the marginals of all the random variables xt, t < T are Gaussian mixtures with their ﬁrst and second order momen...

work page 2021

[26] [26]

xT − √αT x0√1 − αT + δ k T , σ2 T I − ∆k T ) qσ,M(xT |x0) dxT = K∑ k=1 πk T ∫ xT N (√ αT −1x0 + √ 1 − αT −1 − σ2 T

Using Bayes’ rule, the marginal at xT −1 is given by qσ,M(xT −1|x0) = ∫ xT qσ,M(xT −1|xT , x0)qσ,M(xT |x0) dxT = ∫ xT K∑ k=1 πk T N (√αT −1x0 + √ 1 − αT −1 − σ2 T . xT − √αT x0√1 − αT + δ k T , σ2 T I − ∆k T ) qσ,M(xT |x0) dxT = K∑ k=1 πk T ∫ xT N (√ αT −1x0 + √ 1 − αT −1 − σ2 T . xT − √αT x0√1 − αT + δ k T , σ2 T I − ∆k T ) N (√ αT x0, (1 − αT )I) d xT ,...

work page 2006

[27] [27]

Note that the inference process of the propos ed implicit model is non-Gaussian and non-Markovian in general and diﬀe rent from Gaussian diﬀusion

and GM M(t) are the marginal GMMs at steps t − 1 and t respectively. Note that the inference process of the propos ed implicit model is non-Gaussian and non-Markovian in general and diﬀe rent from Gaussian diﬀusion. A.4 Upper Bound of ELBO using the DDIM-GMM Inference Process In this section we provide an upper bound for the ELBO loss usi ng the proposed ...

work page 2021

[28] [28]

as a surrog ate for optimization. Assuming that there is a one-to-one correspondence between the mixture compone nts of the GMMs using the true and estimated value of x0 above, we can use the matched bound (Hershey & Olsen, 2007; Do ,

work page 2007

[29] [29]

All our diﬀusion models are t rained in the latent space of a VQV AE (Rombach et al., 2022)

A.5 Experimental Details We provide additional details on the experiments reported i n Section 5, speciﬁcally for the CelebAHQ, FFHQ and ImageNet experiments. All our diﬀusion models are t rained in the latent space of a VQV AE (Rombach et al., 2022). The input images to the VQV AE are at a r esolution of 256x256 pixels. Each of the VQV AEs are trained on...

work page 2022

[30] [30]

All our diﬀusion models are trained with 1000 forward steps u sing a linear noise ( βt = 1 − αt αt−1 ) schedule of [ β0 = 0 .0015, β1000 = 0 .0195]

We train a f 4 VQV AE (#embeddings=8192), with no attention layers, on ImageNet for 712k steps and use its late nt space to train the diﬀusion models on FFHQ. All our diﬀusion models are trained with 1000 forward steps u sing a linear noise ( βt = 1 − αt αt−1 ) schedule of [ β0 = 0 .0015, β1000 = 0 .0195]. We use the U-Net architecture (Ho et al., 2020; R...

work page 2020

[31] [31]

10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η = 0.0 DDIM DDIM- GMM-R AND DDIM- GMM- OR THO DDIM- GMM- OR THO - VUB 10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η =

and poses chal lenges for higher order ODE solvers (Lu et al., 2023). 10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η = 0.0 DDIM DDIM- GMM-R AND DDIM- GMM- OR THO DDIM- GMM- OR THO - VUB 10 100 #Steps 0 5 10 15 20 25 FID DDPM DDPM η =

work page 2023

[32] [32]

We use the GMM-ORTHO-VUB sampler and ﬁx the mixtur e weights of the components to be uniform in all these experiments. A.8.1 Number of mixture components We compute the FID on the validation set by choosing one of 8, 2 56, or 1024 components ( n) at each step during sampling, using diﬀerent values of η. For each choice of n, we select the scale s among (0...

work page 2023