From Score Matching to Diffusion: A Fine-Grained Error Analysis in the Gaussian Setting

Gabriel Peyr\'e; Matthieu Terris; Samuel Hurault; Thomas Moreau

arxiv: 2503.11615 · v3 · pith:ZWFSTZQZnew · submitted 2025-03-14 · 💻 cs.LG · math.OC

From Score Matching to Diffusion: A Fine-Grained Error Analysis in the Gaussian Setting

Samuel Hurault , Matthieu Terris , Thomas Moreau , Gabriel Peyr\'e This is my paper

Pith reviewed 2026-05-23 00:05 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords score matchingdiffusion samplingWasserstein distanceGaussian distributionpower spectrumerror analysissampling error

0 comments

The pith

Wasserstein sampling error in the Gaussian case equals a kernel-type norm of the data power spectrum whose kernel depends on all method parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives an exact expression for the total Wasserstein error when a diffusion sampler is applied after score matching, restricted to the case where the target distribution is Gaussian. The contributions from generalization error, optimization error, discretization error, and minimal noise amplitude combine into one expression that applies a parameter-dependent kernel to the power spectrum of the data. This makes the interaction between data anisotropy and algorithmic choices fully explicit. A reader would care because the formula replaces opaque bounds with a concrete object that can be inspected frequency by frequency as sample size, step sizes, and noise level vary.

Core claim

In the Gaussian setting the Wasserstein sampling error that arises from the four error sources can be expressed exactly as a kernel-type norm of the data power spectrum, where the specific kernel is determined by the number of initial samples, the step sizes used in score matching and in the diffusion process, and the minimal noise amplitude.

What carries the argument

Kernel-type norm of the data power spectrum whose kernel encodes the combined effects of score-matching and diffusion parameters.

If this is right

Each eigenvalue of the covariance (each frequency in the power spectrum) contributes to the total error with a weight fixed by the chosen parameters.
Increasing the number of samples shrinks the generalization term inside the kernel in a frequency-dependent way.
Step-size and noise-amplitude choices trade off discretization error against the noise term inside the same kernel.
The overall error can be minimized by selecting parameters that make the integrated kernel smallest for a given power spectrum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same kernel perspective might be used to compare different diffusion schedules without new derivations.
For data that is only approximately Gaussian the leading error term would still be governed by the same power-spectrum kernel.
The analysis isolates which parameter most strongly damps high-frequency components of the spectrum.

Load-bearing premise

The target distribution is exactly Gaussian so that power-spectrum interactions admit closed-form expressions.

What would settle it

Pick a concrete covariance matrix, fix all algorithm parameters, run the full score-matching-plus-diffusion procedure on samples from that Gaussian, compute the empirical Wasserstein distance to the target, and check whether the numerical value equals the kernel-norm prediction.

Figures

Figures reproduced from arXiv: 2503.11615 by Gabriel Peyr\'e, Matthieu Terris, Samuel Hurault, Thomas Moreau.

**Figure 1.** Figure 1: W2 error, w.r.t τ (top) and N (bottom), between the theoretical (Theorem 1) and the empirical Gaussian approximations of the stationary distribution of the SGD algorithm (11). Numerical validation [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Theoretical vs Empirical Langevin sampling W2 error as a function of σ, for various score training parameters τ (top) and N (bottom). Numerical validation We verify numerically this result [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Theoretical and empirical W2 errors of discretized diffusion at step k, w.r.t. T −tk, with constant stepsize γ and (exponentially) decreasing stepsize γk. Numerical validation We verify numerically this result in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Power spectrum of Cdata for different values of the power law coefficient ζ [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗

**Figure 6.** Figure 6: Evolution of the optimal noise level parameter [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗

**Figure 7.** Figure 7: Evolution of the optimal stopping time r ∗ k = T − t ∗ k with respect to the SGD learning rate τ and the data power law coefficient ζ. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_7.png] view at source ↗

read the original abstract

Sampling from an unknown distribution, accessible only through discrete samples, is a fundamental problem at the core of generative AI. The current state-of-the-art methods follow a two-step process: first, estimating the score function (the gradient of a smoothed log-distribution) and then applying a diffusion-based sampling algorithm -- such as Langevin or Diffusion models. The resulting distribution's correctness can be impacted by four major factors: the generalization and optimization errors in score matching, and the discretization and minimal noise amplitude in the diffusion. In this paper, we make the sampling error explicit when using a diffusion sampler in the Gaussian setting. We provide a sharp analysis of the Wasserstein sampling error that arises from these four error sources. This allows us to rigorously track how the anisotropy of the data distribution (encoded by its power spectrum) interacts with key parameters of the end-to-end sampling method, including the number of initial samples, the stepsizes in both score matching and diffusion, and the noise amplitude. Notably, we show that the Wasserstein sampling error can be expressed as a kernel-type norm of the data power spectrum, where the specific kernel depends on the method parameters. This result provides a foundation for further analysis of the tradeoffs involved in optimizing sampling accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main advance is an explicit closed-form kernel-norm expression for the total Wasserstein error under exact Gaussian targets.

read the letter

The paper derives a closed-form expression for the end-to-end Wasserstein sampling error when the target is Gaussian. It decomposes the error into four pieces—generalization and optimization from score matching, plus discretization and minimal noise from the diffusion sampler—then shows the sum equals a kernel-type norm applied to the data power spectrum, with the kernel fixed by the method parameters such as stepsizes and noise amplitude. This makes the interaction between anisotropy and those parameters concrete and trackable. The Gaussian setting is stated up front as the condition that permits the closed forms, so the scope is clear. Within that scope the result looks like a genuine sharpening over earlier diffusion error analyses, because it supplies an explicit handle rather than bounds or asymptotics. The main limitation is the restriction to Gaussians; nothing in the claim suggests it extends immediately, and the paper does not pretend otherwise. The stress-test note found no internal inconsistencies or unsupported steps in the stated goal, which aligns with the abstract's description. For readers working on theoretical error analysis in score-based models, the explicit form could be useful for studying parameter tradeoffs. It is narrow enough that most people outside that subfield will not need it, but the derivation is precise enough to merit checking the details. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper analyzes sampling error for diffusion-based methods (score matching followed by Langevin or diffusion sampling) when the target is exactly Gaussian. It decomposes the end-to-end Wasserstein error into four explicit sources—score-matching generalization, score-matching optimization, diffusion discretization, and minimal-noise amplitude—and derives a closed-form expression showing that the total error equals a kernel norm of the data power spectrum, with the kernel determined by the method parameters (sample size, step sizes, noise level).

Significance. If the derivations hold, the result supplies an exact, non-asymptotic characterization of how data anisotropy interacts with every tunable parameter in the pipeline. This is a concrete advance over generic convergence bounds because it is fully explicit, parameter-free within the Gaussian setting, and directly falsifiable by computing the kernel norm on any given covariance. The Gaussian case is a standard benchmark; the closed-form therefore offers a rigorous foundation for studying parameter trade-offs that more general analyses cannot yet reach.

major comments (2)

[§3–4] §3–4 (four-error decomposition): the claim that the four sources are exhaustive and additively combine in Wasserstein distance requires an explicit triangle-inequality argument or coupling construction; without seeing the precise statement of how the score error propagates through the diffusion SDE, it is unclear whether cross terms vanish or are absorbed into the kernel.
[Theorem 1] Theorem 1 (kernel-norm expression): the derivation that the Wasserstein error reduces exactly to ∫ K(λ) |ˆμ(λ)|^2 dλ (where K depends on step sizes and noise) must be checked for the precise definition of the power spectrum ˆμ and for whether the minimal-noise term is treated as an additive bias or folded into the kernel; any hidden dependence on the unknown covariance would contradict the “parameter-free” phrasing in the abstract.

minor comments (2)

Notation for the power spectrum and the four kernels should be introduced once in a single table or definition block rather than redefined inline in each subsection.
[Introduction] The abstract states the result for “the Gaussian setting” but the introduction should explicitly list the four assumptions (exact Gaussianity, known functional form of the score, etc.) that enable the closed forms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and the recommendation for minor revision. The comments help strengthen the presentation of our results. We address each major comment below and plan to incorporate the suggested clarifications.

read point-by-point responses

Referee: [§3–4] §3–4 (four-error decomposition): the claim that the four sources are exhaustive and additively combine in Wasserstein distance requires an explicit triangle-inequality argument or coupling construction; without seeing the precise statement of how the score error propagates through the diffusion SDE, it is unclear whether cross terms vanish or are absorbed into the kernel.

Authors: We thank the referee for highlighting this point. The four-error decomposition in Sections 3 and 4 is obtained by applying the triangle inequality to the Wasserstein distance between the true distribution and the output of the composed pipeline (score matching followed by diffusion sampling). In the Gaussian setting, the diffusion SDE is linear, allowing us to propagate the score error explicitly through the Ornstein-Uhlenbeck process without residual cross terms; these are absorbed into the resulting kernel. We will add an explicit coupling construction and a supporting lemma in the revised manuscript to clarify this propagation. revision: yes
Referee: [Theorem 1] Theorem 1 (kernel-norm expression): the derivation that the Wasserstein error reduces exactly to ∫ K(λ) |ˆμ(λ)|^2 dλ (where K depends on step sizes and noise) must be checked for the precise definition of the power spectrum ˆμ and for whether the minimal-noise term is treated as an additive bias or folded into the kernel; any hidden dependence on the unknown covariance would contradict the “parameter-free” phrasing in the abstract.

Authors: The power spectrum ˆμ is precisely the squared eigenvalues of the covariance matrix in its eigenbasis, i.e., if Σ = U D U^T then ˆμ(λ_i) = D_ii for the corresponding frequencies λ_i. The minimal-noise amplitude contributes an additive term to the kernel K(λ) = K_score(λ) + K_diff(λ) + K_noise(λ), where K_noise(λ) depends only on the noise level and step sizes, not on the covariance. Thus the full expression depends on the data only through its power spectrum, consistent with the parameter-free claim (meaning independent of other aspects of the distribution). We will include the exact definition of ˆμ and the decomposition of K in the revision for clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained under stated Gaussian assumption

full rationale

The paper states up front that it restricts to the exact-Gaussian setting to obtain closed-form expressions for the four-error decomposition of Wasserstein sampling error. The central result expresses this error as a kernel norm on the data power spectrum, with the kernel depending on method parameters; this is presented as an explicit algebraic derivation rather than a fit, a redefinition, or a self-citation chain. No load-bearing step reduces to an input by construction, no fitted quantity is relabeled as a prediction, and no uniqueness theorem or ansatz is imported via self-citation. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption that the target is Gaussian, which supplies the power-spectrum representation used to obtain the kernel expression; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Target distribution is exactly Gaussian
Invoked to obtain closed-form power-spectrum interactions and explicit Wasserstein error

pith-pipeline@v0.9.0 · 5760 in / 1162 out tokens · 36802 ms · 2026-05-23T00:05:03.076946+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geometry-Aware Discretization Error of Diffusion Models
cs.LG 2026-05 unverdicted novelty 7.0

First-order asymptotic expansions of weak and Fréchet discretization errors in diffusion sampling are derived, explicit under Gaussian data through covariance geometry and robust to other data geometries.
On The Hidden Biases of Flow Matching Samplers
stat.ML 2025-12 unverdicted novelty 7.0

Empirical flow matching introduces coupled biases from plug-in estimation, including altered statistical targets, non-gradient minimizers, and non-unique dynamics via flux-null fields, with base distribution controlli...

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n)

Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). In Advances in Neural Information Processing Systems, 2013

work page 2013
[2]

Nearly d-linear convergence bounds for diffusion models via stochastic localization

Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Nearly d-linear convergence bounds for diffusion models via stochastic localization. International Conference on Learning Representations, 2024

work page 2024
[3]

Dynamical regimes of diffusion models

Giulio Biroli, Tony Bonnaire, Valentin de Bortoli, and Marc Mézard. Dynamical regimes of diffusion models. Nature Communications, 15(1):9957, November 2024

work page 2024
[4]

Generative modeling with denoising auto-encoders and langevin sampling

Adam Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling. arXiv preprint arXiv:2002.00107, 2020

work page arXiv 2002
[5]

Optimization methods for large-scale machine learning

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018

work page 2018
[6]

Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions

Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pages 4735–4763. PMLR, 2023

work page 2023
[7]

Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. International Conference on Learning Representations, 2023

work page 2023
[8]

Theoretical guarantees for approximate sampling from smooth and log- concave densities

Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log- concave densities. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):651–676, 2017

work page 2017
[9]

User-friendly guarantees for the langevin monte carlo with inaccurate gradient

Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019

work page 2019
[10]

Bounding the error of dis- cretized langevin algorithms for non-strongly log-concave targets

Arnak S Dalalyan, Avetik Karagulyan, and Lionel Riou-Durand. Bounding the error of dis- cretized langevin algorithms for non-strongly log-concave targets. Journal of Machine Learning Research, 23(235):1–38, 2022

work page 2022
[11]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, 2021

work page 2021
[12]

Bridging the gap between constant step size stochastic gradient descent and markov chains

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Annals of Statistics, 48:1348–1382, 2020

work page 2020
[13]

Analysis of langevin monte carlo via convex optimization

Alain Durmus, Szymon Majewski, and Bła˙zej Miasojedow. Analysis of langevin monte carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019

work page 2019
[14]

Nonasymptotic convergence analysis for the unadjusted Langevin algorithm

Alain Durmus and Éric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 – 1587, 2017

work page 2017
[15]

Reflection couplings and contraction rates for diffusions

Andreas Eberle. Reflection couplings and contraction rates for diffusions. Probability theory and related fields, 166:851–886, 2016

work page 2016
[16]

Tweedie’s formula and selection bias

Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011

work page 2011
[17]

Improved sample complexity bounds for diffusion model training

Shivam Gupta, Aditya Parulekar, Eric Price, and Zhiyang Xun. Improved sample complexity bounds for diffusion model training. In Advances in Neural Information Processing Systems, 2024

work page 2024
[18]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020

work page 2020
[19]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In Advances in Neural Information Processing Systems, 2022

work page 2022
[20]

Estimation of non-normalized statistical models by score matching

Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005

work page 2005
[21]

Generalization in diffusion models arises from geometry-adaptive harmonic representations

Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. arXiv preprint arXiv:2310.02557, 2023. 10

work page arXiv 2023
[22]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022

work page 2022
[23]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024

work page 2024
[24]

Sur la théorie du mouvement brownien

Paul Langevin et al. Sur la théorie du mouvement brownien. CR Acad. Sci. Paris, 146(530- 533):530, 1908

work page 1908
[25]

Bayesian imaging using plug & play priors: when langevin meets tweedie

Rémi Laumont, Valentin De Bortoli, Andrés Almansa, Julie Delon, Alain Durmus, and Marcelo Pereyra. Bayesian imaging using plug & play priors: when langevin meets tweedie. SIAM Journal on Imaging Sciences, 15(2):701–737, 2022

work page 2022
[26]

Towards faster non-asymptotic convergence for diffusion-based generative models

Gen Li, Yuting Wei, Yuxin Chen, and Yuejie Chi. Towards faster non-asymptotic convergence for diffusion-based generative models. arXiv preprint arXiv:2306.09251, 2023

work page arXiv 2023
[27]

Sqrt(d) dimension dependence of langevin monte carlo

Ruilin Li, Hongyuan Zha, and Molei Tao. Sqrt(d) dimension dependence of langevin monte carlo. In International Conference on Learning Representations, 2022

work page 2022
[28]

Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure

Xiang Li, Yixiang Dai, and Qing Qu. Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure. Advances in neural information processing systems, 37:57499–57538, 2024

work page 2024
[29]

Wasserstein Riemannian Geometry of Positive Definite Matrices

Luigi Malago, Luigi Montrucchio, and Giovanni Pistone. Wasserstein riemannian geometry of positive definite matrices. arXiv preprint arXiv:1801.09269, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

A variational analysis of stochastic gradient algorithms

Stephan Mandt, Matthew Hoffman, and David Blei. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, pages 354–363. PMLR, 2016

work page 2016
[31]

Markov chains and stochastic stability

Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012

work page 2012
[32]

Diffusion models for gaussian distributions: Exact solutions and wasserstein errors

Emile Pierret and Bruno Galerne. Diffusion models for gaussian distributions: Exact solutions and wasserstein errors. arXiv preprint arXiv:2405.14250, 2024

work page arXiv 2024
[33]

Exponential convergence of langevin distributions and their discrete approximations

Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996

work page 1996
[34]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021

work page 2021
[35]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, 2019

work page 2019
[36]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021
[37]

A connection between score matching and denoising autoencoders

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011

work page 2011
[38]

The hidden linear structure in score-based models and its application

Binxu Wang and John J Vastola. The hidden linear structure in score-based models and its application. arXiv preprint arXiv:2311.10892, 2023

work page arXiv 2023
[39]

The unreasonable effectiveness of gaussian score approxima- tion for diffusion models and its applications

Binxu Wang and John J Vastola. The unreasonable effectiveness of gaussian score approximation for diffusion models and its applications. arXiv preprint arXiv:2412.09726, 2024

work page arXiv 2024
[40]

Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem

Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027. PMLR, 2018

work page 2093
[41]

Taking a big step: Large learning rates in denoising score matching prevent memorization

Yu-Han Wu, Pierre Marion, Gérard Biau, and Claire Boyer. Taking a big step: Large learning rates in denoising score matching prevent memorization. arXiv preprint arXiv:2502.03435, 2025

work page arXiv 2025
[42]

Sample complexity bounds for score- matching: Causal discovery and generative modeling

Zhenyu Zhu, Francesco Locatello, and V olkan Cevher. Sample complexity bounds for score- matching: Causal discovery and generative modeling. In Advances in Neural Information Processing Systems, 2023. 11 A Denoising Score Matching error A.1 Proof of Theorem 1 Theorem 1 (Optimization and generalization errors in SGD for denoising score matching). Under Ass...

work page 2023
[43]

which simplifies to the desired result

+ o(ε2) (220) we thus get X1 = L−1 H1/2 0 [H1] (221) X2 = −L−1 H1/2 0 [X2 1] (222) 25 Applyting this result in (217) with H0 → Σ1/2H0Σ1/2 and H1 → Σ1/2H1Σ1/2, we get B2(Σ, H0 + εH1) = Tr Σ + H0 + εH1 − 2 Σ1/2H0Σ1/2 + εL−1 Σ1/2H0Σ1/2[Σ1/2H1Σ1/2] (223) −ε2L−1 Σ1/2H0Σ1/2 L−1 Σ1/2H0Σ1/2[Σ1/2H1Σ1/2] 2 + o(ε2). which simplifies to the desired result. The rest o...

work page

[1] [1]

Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n)

Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). In Advances in Neural Information Processing Systems, 2013

work page 2013

[2] [2]

Nearly d-linear convergence bounds for diffusion models via stochastic localization

Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Nearly d-linear convergence bounds for diffusion models via stochastic localization. International Conference on Learning Representations, 2024

work page 2024

[3] [3]

Dynamical regimes of diffusion models

Giulio Biroli, Tony Bonnaire, Valentin de Bortoli, and Marc Mézard. Dynamical regimes of diffusion models. Nature Communications, 15(1):9957, November 2024

work page 2024

[4] [4]

Generative modeling with denoising auto-encoders and langevin sampling

Adam Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling. arXiv preprint arXiv:2002.00107, 2020

work page arXiv 2002

[5] [5]

Optimization methods for large-scale machine learning

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018

work page 2018

[6] [6]

Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions

Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. In International Conference on Machine Learning, pages 4735–4763. PMLR, 2023

work page 2023

[7] [7]

Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. International Conference on Learning Representations, 2023

work page 2023

[8] [8]

Theoretical guarantees for approximate sampling from smooth and log- concave densities

Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log- concave densities. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):651–676, 2017

work page 2017

[9] [9]

User-friendly guarantees for the langevin monte carlo with inaccurate gradient

Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019

work page 2019

[10] [10]

Bounding the error of dis- cretized langevin algorithms for non-strongly log-concave targets

Arnak S Dalalyan, Avetik Karagulyan, and Lionel Riou-Durand. Bounding the error of dis- cretized langevin algorithms for non-strongly log-concave targets. Journal of Machine Learning Research, 23(235):1–38, 2022

work page 2022

[11] [11]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, 2021

work page 2021

[12] [12]

Bridging the gap between constant step size stochastic gradient descent and markov chains

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Annals of Statistics, 48:1348–1382, 2020

work page 2020

[13] [13]

Analysis of langevin monte carlo via convex optimization

Alain Durmus, Szymon Majewski, and Bła˙zej Miasojedow. Analysis of langevin monte carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019

work page 2019

[14] [14]

Nonasymptotic convergence analysis for the unadjusted Langevin algorithm

Alain Durmus and Éric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 – 1587, 2017

work page 2017

[15] [15]

Reflection couplings and contraction rates for diffusions

Andreas Eberle. Reflection couplings and contraction rates for diffusions. Probability theory and related fields, 166:851–886, 2016

work page 2016

[16] [16]

Tweedie’s formula and selection bias

Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011

work page 2011

[17] [17]

Improved sample complexity bounds for diffusion model training

Shivam Gupta, Aditya Parulekar, Eric Price, and Zhiyang Xun. Improved sample complexity bounds for diffusion model training. In Advances in Neural Information Processing Systems, 2024

work page 2024

[18] [18]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020

work page 2020

[19] [19]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In Advances in Neural Information Processing Systems, 2022

work page 2022

[20] [20]

Estimation of non-normalized statistical models by score matching

Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005

work page 2005

[21] [21]

Generalization in diffusion models arises from geometry-adaptive harmonic representations

Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. arXiv preprint arXiv:2310.02557, 2023. 10

work page arXiv 2023

[22] [22]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 2022

work page 2022

[23] [23]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024

work page 2024

[24] [24]

Sur la théorie du mouvement brownien

Paul Langevin et al. Sur la théorie du mouvement brownien. CR Acad. Sci. Paris, 146(530- 533):530, 1908

work page 1908

[25] [25]

Bayesian imaging using plug & play priors: when langevin meets tweedie

Rémi Laumont, Valentin De Bortoli, Andrés Almansa, Julie Delon, Alain Durmus, and Marcelo Pereyra. Bayesian imaging using plug & play priors: when langevin meets tweedie. SIAM Journal on Imaging Sciences, 15(2):701–737, 2022

work page 2022

[26] [26]

Towards faster non-asymptotic convergence for diffusion-based generative models

Gen Li, Yuting Wei, Yuxin Chen, and Yuejie Chi. Towards faster non-asymptotic convergence for diffusion-based generative models. arXiv preprint arXiv:2306.09251, 2023

work page arXiv 2023

[27] [27]

Sqrt(d) dimension dependence of langevin monte carlo

Ruilin Li, Hongyuan Zha, and Molei Tao. Sqrt(d) dimension dependence of langevin monte carlo. In International Conference on Learning Representations, 2022

work page 2022

[28] [28]

Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure

Xiang Li, Yixiang Dai, and Qing Qu. Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure. Advances in neural information processing systems, 37:57499–57538, 2024

work page 2024

[29] [29]

Wasserstein Riemannian Geometry of Positive Definite Matrices

Luigi Malago, Luigi Montrucchio, and Giovanni Pistone. Wasserstein riemannian geometry of positive definite matrices. arXiv preprint arXiv:1801.09269, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

A variational analysis of stochastic gradient algorithms

Stephan Mandt, Matthew Hoffman, and David Blei. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, pages 354–363. PMLR, 2016

work page 2016

[31] [31]

Markov chains and stochastic stability

Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012

work page 2012

[32] [32]

Diffusion models for gaussian distributions: Exact solutions and wasserstein errors

Emile Pierret and Bruno Galerne. Diffusion models for gaussian distributions: Exact solutions and wasserstein errors. arXiv preprint arXiv:2405.14250, 2024

work page arXiv 2024

[33] [33]

Exponential convergence of langevin distributions and their discrete approximations

Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996

work page 1996

[34] [34]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021

work page 2021

[35] [35]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, 2019

work page 2019

[36] [36]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021

[37] [37]

A connection between score matching and denoising autoencoders

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011

work page 2011

[38] [38]

The hidden linear structure in score-based models and its application

Binxu Wang and John J Vastola. The hidden linear structure in score-based models and its application. arXiv preprint arXiv:2311.10892, 2023

work page arXiv 2023

[39] [39]

The unreasonable effectiveness of gaussian score approxima- tion for diffusion models and its applications

Binxu Wang and John J Vastola. The unreasonable effectiveness of gaussian score approximation for diffusion models and its applications. arXiv preprint arXiv:2412.09726, 2024

work page arXiv 2024

[40] [40]

Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem

Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027. PMLR, 2018

work page 2093

[41] [41]

Taking a big step: Large learning rates in denoising score matching prevent memorization

Yu-Han Wu, Pierre Marion, Gérard Biau, and Claire Boyer. Taking a big step: Large learning rates in denoising score matching prevent memorization. arXiv preprint arXiv:2502.03435, 2025

work page arXiv 2025

[42] [42]

Sample complexity bounds for score- matching: Causal discovery and generative modeling

Zhenyu Zhu, Francesco Locatello, and V olkan Cevher. Sample complexity bounds for score- matching: Causal discovery and generative modeling. In Advances in Neural Information Processing Systems, 2023. 11 A Denoising Score Matching error A.1 Proof of Theorem 1 Theorem 1 (Optimization and generalization errors in SGD for denoising score matching). Under Ass...

work page 2023

[43] [43]

which simplifies to the desired result

+ o(ε2) (220) we thus get X1 = L−1 H1/2 0 [H1] (221) X2 = −L−1 H1/2 0 [X2 1] (222) 25 Applyting this result in (217) with H0 → Σ1/2H0Σ1/2 and H1 → Σ1/2H1Σ1/2, we get B2(Σ, H0 + εH1) = Tr Σ + H0 + εH1 − 2 Σ1/2H0Σ1/2 + εL−1 Σ1/2H0Σ1/2[Σ1/2H1Σ1/2] (223) −ε2L−1 Σ1/2H0Σ1/2 L−1 Σ1/2H0Σ1/2[Σ1/2H1Σ1/2] 2 + o(ε2). which simplifies to the desired result. The rest o...

work page