Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access

Amrit Singh Bedi; Mudit Gaur; Prashant Trivedi; Sasidhar Kunapuli; Vaneet Aggarwal

arxiv: 2505.18344 · v7 · submitted 2025-05-23 · 💻 cs.LG · cs.AI· stat.ML

Improved Sample Complexity For Diffusion Model Training Without Empirical Risk Minimizer Access

Mudit Gaur , Prashant Trivedi , Sasidhar Kunapuli , Amrit Singh Bedi , Vaneet Aggarwal This is my paper

Pith reviewed 2026-05-19 12:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords diffusion modelsscore estimationsample complexityempirical risk minimizationgenerative modelsneural network optimizationstatistical bounds

0 comments

The pith

Diffusion model score estimation achieves sample complexity O(ε^{-4}) without access to the empirical risk minimizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes how many data samples are needed to train diffusion models by estimating the score function that guides the reverse diffusion process. It shows that the total error can be split into three parts—statistical error from finite samples, approximation error from the neural network class, and optimization error from inexact training—and that each part can be controlled separately. This split removes the exponential dependence on network size that appeared in earlier proofs and eliminates the need to assume that training exactly solves the empirical risk minimization problem. A reader would care because the resulting bound scales only as the fourth power of the inverse accuracy and applies under conditions closer to actual training practice.

Core claim

By decomposing the score estimation error into statistical, approximation, and optimization errors that can be bounded independently, the analysis yields a sample complexity of O(ε^{-4}) for diffusion model training. This bound holds without assuming that the optimization procedure reaches the exact empirical risk minimizer of the score estimation loss and without incurring exponential dependence on the parameters of the neural network used to represent the score.

What carries the argument

structured decomposition of the score estimation error into statistical, approximation, and optimization errors

If this is right

The number of samples required grows only polynomially with the inverse of the target accuracy.
Optimization routines that stop short of the global minimum of the empirical loss still suffice for the theoretical guarantee.
The bound applies without an exponential penalty as the neural network width or depth grows.
Theoretical sample requirements no longer rely on the unrealistic premise of perfect empirical risk minimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same error-splitting technique could be applied to analyze sample needs in other score-based generative approaches such as continuous normalizing flows.
It may be possible to combine this bound with existing discretization-error analyses to obtain end-to-end guarantees for the full sampling procedure.
Practitioners could use the explicit dependence on optimization quality to decide when further training iterations are no longer worth additional samples.
The decomposition suggests a route toward analyses that remain valid even when the score network is trained with stochastic gradient methods that never reach exact minima.

Load-bearing premise

The three error terms arising from the decomposition can be bounded independently without reintroducing exponential dependence on the neural network parameters.

What would settle it

An experiment or calculation in which increasing the number of training samples fails to reduce the total score estimation error at the predicted O(ε^{-4}) rate once optimization is allowed to remain inexact would falsify the bound.

read the original abstract

Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\mathcal{O}(\epsilon^{-4})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result that achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets O(ε^{-4}) sample complexity for diffusion score estimation without the ERM assumption by decomposing the error into three terms, but the optimization piece may still carry exponential network-parameter dependence.

read the letter

The main point is that this work removes the unrealistic ERM access assumption from prior diffusion sample-complexity analyses and still reaches an O(ε^{-4}) bound. That is the concrete advance they advertise over earlier results that either had worse dimension dependence or leaned on an oracle for the score loss minimizer. The approach is a decomposition of total score error into statistical, approximation, and optimization pieces, which they say lets them control each term separately and drop the exponential blow-up in network parameters that showed up before. If the bounds on those three terms really stay polynomial, the claim would be a modest but useful tightening of the theory. The abstract is clear about the positioning, and the decomposition itself is a standard and reasonable way to organize the analysis. Credit for trying to make the assumptions more realistic than the ERM oracle. The soft spot is the optimization error term. Standard neural-net generalization arguments (covering numbers, Rademacher complexity) typically reintroduce factors exponential in depth or width when you bound how well gradient descent approximates the minimizer. The stress-test note flags exactly this risk, and nothing in the abstract or the reader's summary shows an explicit derivation that avoids it. If the paper uses a non-standard rate or an infinite-width limit to control that term, it needs to be stated up front; otherwise the overall O(ε^{-4}) may not hold for the networks people actually train. This is for readers who follow theoretical analyses of score matching and generative models. Someone who wants to see how sample-efficiency bounds change once the ERM crutch is removed will find the setup useful, even if they have to verify the optimization step themselves. The paper is coherent enough on its own terms to deserve a serious referee who can check the full proof of the optimization bound. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The manuscript provides a theoretical analysis of score estimation in diffusion models. It decomposes the estimation error into statistical, approximation, and optimization components to derive a sample complexity bound of O(ε^{-4}), eliminating exponential dependence on neural network parameters and avoiding any assumption of access to the empirical risk minimizer of the score loss.

Significance. If the central bounds hold, the result would be a meaningful contribution to the theory of diffusion models. It improves on prior analyses that either exhibited poor dimension scaling or required unrealistic access to exact ERM solutions. The structured error decomposition is a clear strength and could support more efficient training analyses and practical guidelines.

major comments (2)

[§3] §3 (Error Decomposition): The claim that statistical, approximation, and optimization errors can be bounded independently without exponential dependence on network parameters is load-bearing for the O(ε^{-4}) result. Standard NN generalization tools (Rademacher complexity or covering numbers) typically reintroduce exp(O(depth·width)) factors in the optimization term; the manuscript must explicitly derive or cite the technique (e.g., parameter-independent convergence rate or infinite-width assumption) that prevents this coupling.
[Main Theorem] Main Theorem: The final sample-complexity statement assumes the three error terms remain separable under the stated network class and score-function assumptions. If the optimization error bound implicitly depends on the same parameter count that appears in the approximation error, the claimed improvement over prior work would not hold uniformly; please add the precise assumptions on the score function and network class that guarantee independence.

minor comments (1)

[Abstract] Abstract: A one-sentence outline of the three error terms would make the contribution more immediately accessible to readers unfamiliar with the decomposition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments correctly identify that the independence of the three error terms is central to our O(ε^{-4}) bound. We address each point below and will revise the manuscript to improve clarity and explicitness without altering the core claims.

read point-by-point responses

Referee: [§3] §3 (Error Decomposition): The claim that statistical, approximation, and optimization errors can be bounded independently without exponential dependence on network parameters is load-bearing for the O(ε^{-4}) result. Standard NN generalization tools (Rademacher complexity or covering numbers) typically reintroduce exp(O(depth·width)) factors in the optimization term; the manuscript must explicitly derive or cite the technique (e.g., parameter-independent convergence rate or infinite-width assumption) that prevents this coupling.

Authors: We agree that explicit justification is necessary. Our optimization error bound (Lemma 3.2) relies on the smoothness and strong convexity of the population score-matching loss under the forward diffusion process, which yields a gradient-descent convergence rate whose leading term depends only on the Lipschitz constant of the score and the step-size schedule, not on network width or depth. This rate is derived in the appendix by controlling the gradient norm via the diffusion variance schedule rather than via uniform convergence over the parameter space. We will move this derivation into the main text of §3, add a short paragraph contrasting it with standard Rademacher-based bounds, and cite the relevant parameter-independent optimization literature for overparameterized models. revision: yes
Referee: [Main Theorem] Main Theorem: The final sample-complexity statement assumes the three error terms remain separable under the stated network class and score-function assumptions. If the optimization error bound implicitly depends on the same parameter count that appears in the approximation error, the claimed improvement over prior work would not hold uniformly; please add the precise assumptions on the score function and network class that guarantee independence.

Authors: The separability follows from two sets of assumptions already present in the manuscript but not sufficiently highlighted in the theorem statement. Assumption 2.1 requires the target score to be Lipschitz with bounded second moments under the data distribution; Assumption 3.1 requires the neural network class to have a Lipschitz constant that is controlled independently of the number of parameters (via a bounded-norm constraint on the weights). Under these conditions the approximation error scales with network capacity while the optimization error is governed solely by the loss curvature induced by the diffusion process. We will restate the main theorem with an explicit list of these assumptions and a one-paragraph explanation of why they decouple the three terms. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via standard error bounds

full rationale

The paper derives the O(ε^{-4}) sample complexity bound through a structured decomposition of score estimation error into statistical, approximation, and optimization components. This decomposition is justified using standard concentration inequalities and approximation theory results that are independent of the target bound and do not reduce to fitted quantities or self-citations by construction. No load-bearing step equates a prediction to its own inputs, and the elimination of exponential NN-parameter dependence follows from the separation of error terms rather than redefinition. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard assumptions about the smoothness of the score function and the approximation power of neural networks, plus the ability to separately bound the three error components without cross terms that reintroduce exponential dependence.

axioms (1)

domain assumption The score function satisfies sufficient regularity conditions (e.g., Lipschitz or smoothness) to allow separate bounding of statistical, approximation, and optimization errors.
Invoked to enable the structured decomposition described in the abstract.

pith-pipeline@v0.9.0 · 5670 in / 1298 out tokens · 37093 ms · 2026-05-19T12:32:45.014036+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

Kernel ridge regression with predicted feature inputs and applications to factor-based nonparametric regression.arXiv preprint arXiv:2505.20022,

Xin Bing, Xin He, and Chao Wang. Kernel ridge regression with predicted feature inputs and applications to factor-based nonparametric regression.arXiv preprint arXiv:2505.20022,

work page arXiv
[2]

Generative modeling with denoising auto-encoders and langevin sampling.arXiv preprint arXiv:2002.00107,

Adam Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling.arXiv preprint arXiv:2002.00107,

work page arXiv 2002
[3]

Adaptive guidance: Training-free acceleration of conditional diffusion models

Angela Castillo, Jonas Kohler, Juan C Pérez, Juan Pablo Pérez, Albert Pumarola, Bernard Ghanem, Pablo Arbeláez, and Ali Thabet. Adaptive guidance: Training-free acceleration of conditional diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 1962–1970,

work page 1962
[4]

An overview of diffusion models: Applications, guided gen- eration, statistical rates and optimization.arXiv preprint arXiv:2404.07771, 2024

Minshuo Chen, Song Mei, Jianqing Fan, and Mengdi Wang. An overview of diffusion models: Applications, guided generation, statistical rates and optimization.arXiv preprint arXiv:2404.07771,

work page arXiv
[5]

Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. InNeurIPS 2022 Workshop on Score-Based Methods,

work page 2022
[6]

Alessandro Favero, Antonio Sclocchi, and Matthieu Wyart

doi: 10.1109/ACCESS.2018.2890583. Alessandro Favero, Antonio Sclocchi, and Matthieu Wyart. Bigger isn’t always memorizing: Early stopping overparameterized diffusion models.arXiv preprint arXiv:2505.16959,

work page doi:10.1109/access.2018.2890583 2018
[7]

& Chen, M

Hengyu Fu, Zhuoran Yang, Mengdi Wang, and Minshuo Chen. Unveil conditional diffusion models with classifier-free guidance: A sharp statistical theory.arXiv preprint arXiv:2403.11968,

work page arXiv
[8]

A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach

13 Published in Transactions on Machine Learning Research (04/2026) Swetha Ganesh, Washim Uddin Mondal, and Vaneet Aggarwal. A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach. InForty-second International Conference on Machine Learning,

work page 2026
[9]

Improved sample complexity bounds for diffusion model training

Shivam Gupta, Aditya Parulekar, Eric Price, and Zhiyang Xun. Improved sample complexity bounds for diffusion model training. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (updated version at arXiv:2311.13745v4, Nov 2025),

work page arXiv 2025
[10]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Fastdiff: A fast conditional diffusion model for high-quality speech synthesis.arXiv preprint arXiv:2204.09934,

Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis.arXiv preprint arXiv:2204.09934,

work page arXiv
[12]

Accelerating convergence of score-based diffusion models, provably

Gen Li, Yu Huang, Timofey Efimov, Yuting Wei, Yuejie Chi, and Yuxin Chen. Accelerating convergence of score-based diffusion models, provably. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, NuriaOliver, JonathanScarlett, andFelixBerkenkamp(eds.),Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProce...

work page arXiv
[13]

Low-dimensional adaptation of diffusion models: Conver- gence in total variation

14 Published in Transactions on Machine Learning Research (04/2026) Jiadong Liang, Zhihan Huang, and Yuxin Chen. Low-dimensional adaptation of diffusion models: Conver- gence in total variation. InThe Thirty Eighth Annual Conference on Learning Theory, pp. 3723–3729. PMLR, 2025a. Yuchen Liang, Peizhong Ju, Yingbin Liang, and Ness Shroff. Broadening target...

work page 2026
[14]

Accelerating diffusion models via early stop of the diffusion process.arXiv preprint arXiv:2205.12524,

Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process.arXiv preprint arXiv:2205.12524,

work page arXiv
[15]

Pac-bayesian risk bounds for fully connected deep neural network with gaussian priors.arXiv preprint arXiv:2505.04341,

The Tien Mai. Pac-bayesian risk bounds for fully connected deep neural network with gaussian priors.arXiv preprint arXiv:2505.04341,

work page arXiv
[16]

Order-optimal sample complexity of rectified flows

Hari Krishna Sahoo, Mudit Gaur, and Vaneet Aggarwal. Order-optimal sample complexity of rectified flows. arXiv preprint arXiv:2601.20250,

work page arXiv
[17]

Score-based generative modeling through stochastic differential equations

15 Published in Transactions on Machine Learning Research (04/2026) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations,

work page 2026
[18]

Efficient diffusion models for vision: A survey.arXiv preprint arXiv:2210.09292,

Anwaar Ulhaq and Naveed Akhtar. Efficient diffusion models for vision: A survey.arXiv preprint arXiv:2210.09292,

work page arXiv
[19]

Kaihong Zhang, Heqi Yin, Feng Liang, and Jingbo Liu

URLhttps://arxiv.org/abs/2402.01965. Kaihong Zhang, Heqi Yin, Feng Liang, and Jingbo Liu. Minimax optimality of score-based diffusion mod- els: Beyond the density lower bound assumptions. InForty-first International Conference on Machine Learning,

work page arXiv
[20]

arXiv preprint arXiv:2311.01223 , year=

Zhengbang Zhu, Hanye Zhao, Haoran He, Yichao Zhong, Shenyu Zhang, Haoquan Guo, Tingting Chen, and Weinan Zhang. Diffusion models for reinforcement learning: A survey.arXiv preprint arXiv:2311.01223,

work page arXiv
[21]

16 Published in Transactions on Machine Learning Research (04/2026) A Note about (Gupta et al.,

work page 2026
[22]

B Score Estimation Algorithm In this section, we provide a detailed description of the algorithm used for estimating the score function in diffusion models

We note that the comparison in this paper uses the results from the updated arXiv version of their work (in Nov 2025), which incorporates a correction to an error identified in the earlier version of our work (available at arXiv). B Score Estimation Algorithm In this section, we provide a detailed description of the algorithm used for estimating the score...

work page 2025
[23]

The corresponding empirical loss is defined as: ˆLk(θ) =1 n n∑ i=1 ∥sθ(xi,tk)−∇logptk(xi)∥2.(49) 17 Published in Transactions on Machine Learning Research (04/2026) Letθa k andθb k be the minimizers ofLk(θ)and ˆLt(θ), respectively, corresponding to score functionssa tk and sb tk. By the definitions of minimizers, we can write Lk(θb k)−Lk(θa k)≤Lk(θb k)−Lk...

work page 2026
[24]

We can apply Lemma E.5 to bound the empirical Rademacher complexity ˆR(θ)in terms of the Rademacher complexityR(θ)of the function classΘ′′

we have with probability at least1−δ, ⏐⏐⏐L′ k(θ)−ˆL′k(θ) ⏐⏐⏐≤ˆR(θ) +O   √ log 1 δ n  ,∀θ∈Θ′′.(57) SinceΘ ′′={θa,θb}is a finite class (just two functions). We can apply Lemma E.5 to bound the empirical Rademacher complexity ˆR(θ)in terms of the Rademacher complexityR(θ)of the function classΘ′′. Since ˆR(θ) =1 m Eσ [ maxθ∈Θ′′ ∑n i=1f(θ)σi ] , applying L...

work page 2026
[25]

21 Published in Transactions on Machine Learning Research (04/2026) We get Equation (86) from Equation (85) from Assumption 1, which implies that the second moment ofz is bounded

We get Equation (86) from Equation (85) from Assumption 1 and by using the upper bound on the Mill’s ration which implies thatϕ(κ) 1−Φ(κ)≤κ+1 κ. 21 Published in Transactions on Machine Learning Research (04/2026) We get Equation (86) from Equation (85) from Assumption 1, which implies that the second moment ofz is bounded. d∑ j=1 Exj∼(utk)j   |x|2 0...

work page 2026
[26]

23 Published in Transactions on Machine Learning Research (04/2026) Combining the last three displays, E[L(θi+1)|θi]≤L(θi)−η∥∇L(θi)∥2 2 + Lη2 2 ( ∥∇L(θi)∥2 2 + σ2 bi )

Moreover, E [ ∥gi∥2 2|θi ] =E [ ∥∇L(θi) + (gi−∇L(θi))∥2 2|θi ] =∥∇L(θi)∥2 2 +E [ ∥gi−∇L(θi)∥2 2|θi ] ≤∥∇L(θi)∥2 2 + σ2 bi . 23 Published in Transactions on Machine Learning Research (04/2026) Combining the last three displays, E[L(θi+1)|θi]≤L(θi)−η∥∇L(θi)∥2 2 + Lη2 2 ( ∥∇L(θi)∥2 2 + σ2 bi ) . Rearranging and usingη≤1/L(so thatη−Lη2 2 ≥η/2), E[L(θi+1)|θi]≤...

work page 2026
[27]

Substituting this into (108) gives E [ L(θT )−L⋆] ≤(1−ηµ)T−1∆ 1 +O ( 1√n )

2 ≥βT2 2 , 24 Published in Transactions on Machine Learning Research (04/2026) hence 1 T−1≤2 T ≤2 √ 2 βn=O ( 1√n ) . Substituting this into (108) gives E [ L(θT )−L⋆] ≤(1−ηµ)T−1∆ 1 +O ( 1√n ) . In particular, for sufficiently largen(so the transient term is negligible), E [ L(θT )−L⋆] =O ( 1√n ) . Note thatˆstk and ˆθk denote our estimate of the loss func...

work page 2026
[28]

Proof.Leth 0 =x, and forℓ= 0,...,L−1define the layer recursion hℓ+1=σ(Wℓhℓ+bℓ), whereW ℓ∈Rnℓ+1×nℓ,b ℓ∈Rnℓ+1, andnℓ≤Wfor hidden layers

We get the second inequality by denotingLas the number of layers in the neural network,WandBa constant such all parameters of the neural network upper bounded byB. Proof.Leth 0 =x, and forℓ= 0,...,L−1define the layer recursion hℓ+1=σ(Wℓhℓ+bℓ), whereW ℓ∈Rnℓ+1×nℓ,b ℓ∈Rnℓ+1, andnℓ≤Wfor hidden layers. We work with theℓ∞operator norm: ∥Wℓ∥∞= max i ∑ j |(Wℓ)ij|...

work page 2026

[1] [1]

Kernel ridge regression with predicted feature inputs and applications to factor-based nonparametric regression.arXiv preprint arXiv:2505.20022,

Xin Bing, Xin He, and Chao Wang. Kernel ridge regression with predicted feature inputs and applications to factor-based nonparametric regression.arXiv preprint arXiv:2505.20022,

work page arXiv

[2] [2]

Generative modeling with denoising auto-encoders and langevin sampling.arXiv preprint arXiv:2002.00107,

Adam Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling.arXiv preprint arXiv:2002.00107,

work page arXiv 2002

[3] [3]

Adaptive guidance: Training-free acceleration of conditional diffusion models

Angela Castillo, Jonas Kohler, Juan C Pérez, Juan Pablo Pérez, Albert Pumarola, Bernard Ghanem, Pablo Arbeláez, and Ali Thabet. Adaptive guidance: Training-free acceleration of conditional diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 1962–1970,

work page 1962

[4] [4]

An overview of diffusion models: Applications, guided gen- eration, statistical rates and optimization.arXiv preprint arXiv:2404.07771, 2024

Minshuo Chen, Song Mei, Jianqing Fan, and Mengdi Wang. An overview of diffusion models: Applications, guided generation, statistical rates and optimization.arXiv preprint arXiv:2404.07771,

work page arXiv

[5] [5]

Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. InNeurIPS 2022 Workshop on Score-Based Methods,

work page 2022

[6] [6]

Alessandro Favero, Antonio Sclocchi, and Matthieu Wyart

doi: 10.1109/ACCESS.2018.2890583. Alessandro Favero, Antonio Sclocchi, and Matthieu Wyart. Bigger isn’t always memorizing: Early stopping overparameterized diffusion models.arXiv preprint arXiv:2505.16959,

work page doi:10.1109/access.2018.2890583 2018

[7] [7]

& Chen, M

Hengyu Fu, Zhuoran Yang, Mengdi Wang, and Minshuo Chen. Unveil conditional diffusion models with classifier-free guidance: A sharp statistical theory.arXiv preprint arXiv:2403.11968,

work page arXiv

[8] [8]

A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach

13 Published in Transactions on Machine Learning Research (04/2026) Swetha Ganesh, Washim Uddin Mondal, and Vaneet Aggarwal. A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach. InForty-second International Conference on Machine Learning,

work page 2026

[9] [9]

Improved sample complexity bounds for diffusion model training

Shivam Gupta, Aditya Parulekar, Eric Price, and Zhiyang Xun. Improved sample complexity bounds for diffusion model training. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (updated version at arXiv:2311.13745v4, Nov 2025),

work page arXiv 2025

[10] [10]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Fastdiff: A fast conditional diffusion model for high-quality speech synthesis.arXiv preprint arXiv:2204.09934,

Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis.arXiv preprint arXiv:2204.09934,

work page arXiv

[12] [12]

Accelerating convergence of score-based diffusion models, provably

Gen Li, Yu Huang, Timofey Efimov, Yuting Wei, Yuejie Chi, and Yuxin Chen. Accelerating convergence of score-based diffusion models, provably. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, NuriaOliver, JonathanScarlett, andFelixBerkenkamp(eds.),Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProce...

work page arXiv

[13] [13]

Low-dimensional adaptation of diffusion models: Conver- gence in total variation

14 Published in Transactions on Machine Learning Research (04/2026) Jiadong Liang, Zhihan Huang, and Yuxin Chen. Low-dimensional adaptation of diffusion models: Conver- gence in total variation. InThe Thirty Eighth Annual Conference on Learning Theory, pp. 3723–3729. PMLR, 2025a. Yuchen Liang, Peizhong Ju, Yingbin Liang, and Ness Shroff. Broadening target...

work page 2026

[14] [14]

Accelerating diffusion models via early stop of the diffusion process.arXiv preprint arXiv:2205.12524,

Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process.arXiv preprint arXiv:2205.12524,

work page arXiv

[15] [15]

Pac-bayesian risk bounds for fully connected deep neural network with gaussian priors.arXiv preprint arXiv:2505.04341,

The Tien Mai. Pac-bayesian risk bounds for fully connected deep neural network with gaussian priors.arXiv preprint arXiv:2505.04341,

work page arXiv

[16] [16]

Order-optimal sample complexity of rectified flows

Hari Krishna Sahoo, Mudit Gaur, and Vaneet Aggarwal. Order-optimal sample complexity of rectified flows. arXiv preprint arXiv:2601.20250,

work page arXiv

[17] [17]

Score-based generative modeling through stochastic differential equations

15 Published in Transactions on Machine Learning Research (04/2026) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations,

work page 2026

[18] [18]

Efficient diffusion models for vision: A survey.arXiv preprint arXiv:2210.09292,

Anwaar Ulhaq and Naveed Akhtar. Efficient diffusion models for vision: A survey.arXiv preprint arXiv:2210.09292,

work page arXiv

[19] [19]

Kaihong Zhang, Heqi Yin, Feng Liang, and Jingbo Liu

URLhttps://arxiv.org/abs/2402.01965. Kaihong Zhang, Heqi Yin, Feng Liang, and Jingbo Liu. Minimax optimality of score-based diffusion mod- els: Beyond the density lower bound assumptions. InForty-first International Conference on Machine Learning,

work page arXiv

[20] [20]

arXiv preprint arXiv:2311.01223 , year=

Zhengbang Zhu, Hanye Zhao, Haoran He, Yichao Zhong, Shenyu Zhang, Haoquan Guo, Tingting Chen, and Weinan Zhang. Diffusion models for reinforcement learning: A survey.arXiv preprint arXiv:2311.01223,

work page arXiv

[21] [21]

16 Published in Transactions on Machine Learning Research (04/2026) A Note about (Gupta et al.,

work page 2026

[22] [22]

B Score Estimation Algorithm In this section, we provide a detailed description of the algorithm used for estimating the score function in diffusion models

We note that the comparison in this paper uses the results from the updated arXiv version of their work (in Nov 2025), which incorporates a correction to an error identified in the earlier version of our work (available at arXiv). B Score Estimation Algorithm In this section, we provide a detailed description of the algorithm used for estimating the score...

work page 2025

[23] [23]

The corresponding empirical loss is defined as: ˆLk(θ) =1 n n∑ i=1 ∥sθ(xi,tk)−∇logptk(xi)∥2.(49) 17 Published in Transactions on Machine Learning Research (04/2026) Letθa k andθb k be the minimizers ofLk(θ)and ˆLt(θ), respectively, corresponding to score functionssa tk and sb tk. By the definitions of minimizers, we can write Lk(θb k)−Lk(θa k)≤Lk(θb k)−Lk...

work page 2026

[24] [24]

We can apply Lemma E.5 to bound the empirical Rademacher complexity ˆR(θ)in terms of the Rademacher complexityR(θ)of the function classΘ′′

we have with probability at least1−δ, ⏐⏐⏐L′ k(θ)−ˆL′k(θ) ⏐⏐⏐≤ˆR(θ) +O   √ log 1 δ n  ,∀θ∈Θ′′.(57) SinceΘ ′′={θa,θb}is a finite class (just two functions). We can apply Lemma E.5 to bound the empirical Rademacher complexity ˆR(θ)in terms of the Rademacher complexityR(θ)of the function classΘ′′. Since ˆR(θ) =1 m Eσ [ maxθ∈Θ′′ ∑n i=1f(θ)σi ] , applying L...

work page 2026

[25] [25]

21 Published in Transactions on Machine Learning Research (04/2026) We get Equation (86) from Equation (85) from Assumption 1, which implies that the second moment ofz is bounded

We get Equation (86) from Equation (85) from Assumption 1 and by using the upper bound on the Mill’s ration which implies thatϕ(κ) 1−Φ(κ)≤κ+1 κ. 21 Published in Transactions on Machine Learning Research (04/2026) We get Equation (86) from Equation (85) from Assumption 1, which implies that the second moment ofz is bounded. d∑ j=1 Exj∼(utk)j   |x|2 0...

work page 2026

[26] [26]

23 Published in Transactions on Machine Learning Research (04/2026) Combining the last three displays, E[L(θi+1)|θi]≤L(θi)−η∥∇L(θi)∥2 2 + Lη2 2 ( ∥∇L(θi)∥2 2 + σ2 bi )

Moreover, E [ ∥gi∥2 2|θi ] =E [ ∥∇L(θi) + (gi−∇L(θi))∥2 2|θi ] =∥∇L(θi)∥2 2 +E [ ∥gi−∇L(θi)∥2 2|θi ] ≤∥∇L(θi)∥2 2 + σ2 bi . 23 Published in Transactions on Machine Learning Research (04/2026) Combining the last three displays, E[L(θi+1)|θi]≤L(θi)−η∥∇L(θi)∥2 2 + Lη2 2 ( ∥∇L(θi)∥2 2 + σ2 bi ) . Rearranging and usingη≤1/L(so thatη−Lη2 2 ≥η/2), E[L(θi+1)|θi]≤...

work page 2026

[27] [27]

Substituting this into (108) gives E [ L(θT )−L⋆] ≤(1−ηµ)T−1∆ 1 +O ( 1√n )

2 ≥βT2 2 , 24 Published in Transactions on Machine Learning Research (04/2026) hence 1 T−1≤2 T ≤2 √ 2 βn=O ( 1√n ) . Substituting this into (108) gives E [ L(θT )−L⋆] ≤(1−ηµ)T−1∆ 1 +O ( 1√n ) . In particular, for sufficiently largen(so the transient term is negligible), E [ L(θT )−L⋆] =O ( 1√n ) . Note thatˆstk and ˆθk denote our estimate of the loss func...

work page 2026

[28] [28]

Proof.Leth 0 =x, and forℓ= 0,...,L−1define the layer recursion hℓ+1=σ(Wℓhℓ+bℓ), whereW ℓ∈Rnℓ+1×nℓ,b ℓ∈Rnℓ+1, andnℓ≤Wfor hidden layers

We get the second inequality by denotingLas the number of layers in the neural network,WandBa constant such all parameters of the neural network upper bounded byB. Proof.Leth 0 =x, and forℓ= 0,...,L−1define the layer recursion hℓ+1=σ(Wℓhℓ+bℓ), whereW ℓ∈Rnℓ+1×nℓ,b ℓ∈Rnℓ+1, andnℓ≤Wfor hidden layers. We work with theℓ∞operator norm: ∥Wℓ∥∞= max i ∑ j |(Wℓ)ij|...

work page 2026