pith. sign in

arxiv: 2606.06179 · v2 · pith:BOK7LEZBnew · submitted 2026-06-04 · 📊 stat.ML · cs.LG

Diffusion Models Observe Only Gradients: A Geometric Perspective on Score Matching Errors

Pith reviewed 2026-06-30 10:57 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords diffusion modelsscore matchingHelmholtz-Hodge decompositionFokker-Planck equationgenerative modelingdistributional divergencescore error geometry
0
0 comments X

The pith

A diffusion model can perfectly match the target distribution while incurring arbitrarily large L² score error because only the gradient part of the error affects the marginal dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that minimizing the L² score matching error is not guaranteed to produce good marginal distributions in diffusion models. A model can match the target distribution exactly even when the full L² error grows without bound. The reason is a geometric split of the score error vector field: its gradient component drives the Fokker-Planck evolution of the probability density, while the solenoidal component has no effect on it. This decomposition explains why standard bounds that use the full L² error are loose and why a new estimator focused on the gradient component tracks sample quality more closely.

Core claim

Applying the Helmholtz-Hodge decomposition to the difference between the learned and true score functions isolates a gradient component that alone enters the marginal Fokker-Planck equation; the orthogonal solenoidal component is invisible to the evolution of the probability measure. Consequently no monotone function of the L² score error can serve as a uniform lower bound on any divergence between the learned and target distributions, while an upper bound on the KL divergence can be stated using only the gradient component.

What carries the argument

Helmholtz-Hodge decomposition of the score error vector field, which isolates the gradient component that governs marginal Fokker-Planck dynamics from the solenoidal component that does not.

If this is right

  • No monotone function of the L² score error can uniformly lower-bound any divergence between learned and target distributions.
  • The KL divergence between learned and target distributions admits an upper bound that depends only on the gradient component of the score error.
  • A dual Sobolev identity yields a tractable estimator of the gradient component that correlates more strongly with sample quality than the full L² error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives could be redesigned to penalize only the gradient component rather than the full L² norm.
  • The same decomposition may apply to other generative models that rely on estimated vector fields or scores.
  • Monitoring the size of the solenoidal component during training could diagnose unnecessary model capacity without harming distributional accuracy.

Load-bearing premise

The Helmholtz-Hodge decomposition of the score error correctly identifies a gradient component that alone controls the marginal probability evolution.

What would settle it

Train a score network whose error vector field has a large solenoidal component yet produces samples whose empirical distribution converges to the target; if the samples fail to match whenever the full L² error is large, the claim is falsified.

Figures

Figures reproduced from arXiv: 2606.06179 by Na\"il B. Khelifa, Ramji Venkataramanan, Richard E. Turner.

Figure 1
Figure 1. Figure 1: Helmholtz-Hodge decomposition of score errors. Two score estimates, both with total L 2 score matching error approximately equal to 51, manifest in qualitatively different generated distributions depending on their geometric structure. Red crosses represent true mode locations and blue centroids represent learned modes. Center: with purely solenoidal error fields (green arrows), the generated distribution … view at source ↗
Figure 2
Figure 2. Figure 2: Spearman correlation between feature-FID [27] and score-error norms (Fashion￾MNIST [71]). Mean ± standard-deviation across five seeds. Across all capacities, the gradient component Egrad := Es,xs [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DSM does not reduce the invisible component of the score error on CIFAR-10 [38]. Left. The gradient (observable) component of score estimation errors falls by roughly two orders of magnitude during training, while the solenoidal (invisible) component rises early and then plateaus, eventually accounting for nearly all of the residual error. Right. As a consequence, the standard Girsanov bound (15) saturates… view at source ↗
Figure 4
Figure 4. Figure 4: The Girsanov bound saturates while ours keeps decreasing, across all model capacities. KL upper-bound estimates on Fashion-MNIST for tiny-sized (top left) small-sized (top right) and normal-sized (bottom) score networks. The Girsanov bound (15) stays effectively constant at the solenoidal floor, whereas our bound from Theorem 3.2 tracks the gradient component and decreases by more than three orders of magn… view at source ↗
Figure 4
Figure 4. Figure 4: The Girsanov bound saturates while ours keeps decreasing, across all model capacities. KL upper-bound estimates on Fashion-MNIST for tiny-sized (top left) small-sized (top right) and normal-sized (bottom) score networks. The Girsanov bound (15) stays effectively constant at the solenoidal floor, whereas our bound from Theorem 3.2 tracks the gradient component and decreases by more than three orders of magn… view at source ↗
Figure 5
Figure 5. Figure 5: DSM does not reduce the invisible component of the score error, across all model capacities. Decomposition of the score estimation error es into its observable gradient component ΠGs es and its invisible solenoidal component ΠG⊥s es on Fashion-MNIST, for tiny-sized (top left) small-sized (top right) and normal-sized (bottom) score networks. Across all capacities, the gradient component falls by more than t… view at source ↗
Figure 5
Figure 5. Figure 5: DSM does not reduce the invisible component of the score error, across all model capacities. Decomposition of the score estimation error es into its observable gradient component ΠGs es and its invisible solenoidal component ΠG⊥s es on Fashion-MNIST, for tiny-sized (top left) small-sized (top right) and normal-sized (bottom) score networks. Across all capacities, the gradient component falls by more than t… view at source ↗
read the original abstract

Score-based diffusion models are typically trained by minimizing the $L^2$ score matching error, and standard theoretical analyses rely on this quantity to bound the sampling discrepancy between the learned and target distributions. We show the $L^2$ score error is not the right intrinsic measure of marginal distributional quality: a learned diffusion model can incur arbitrarily large $L^2$ score error while perfectly matching the target distribution. By decomposing score errors into a gradient and a solenoidal component (a Helmholtz-Hodge decomposition), we identify the geometric reason behind this: only the gradient component enters the marginal Fokker-Planck dynamics, while the solenoidal component is structurally invisible. We make this precise in three results. First, building on the corrected geometry, we prove an impossibility result: no monotone function of the $L^2$ score error can uniformly lower bound any divergence between the learned and target distributions. Second, we derive an upper bound on the Kullback-Leibler divergence that depends only on the observable gradient component of the error, tightening the standard Girsanov bound for generic score networks, and identifying its looseness as the cost of operating on path-space rather than marginal-space dynamics. Third, we give a tractable estimator of the gradient component via a dual Sobolev identity, which is shown to empirically correlate substantially better with sample quality than the full $L^2$ error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that the L² score-matching objective is not the right intrinsic measure of marginal quality for learned diffusion models: via a Helmholtz-Hodge decomposition of the score error into gradient and solenoidal components, only the gradient part enters the marginal Fokker-Planck dynamics while the solenoidal part is invisible to them. This geometry is used to prove (i) an impossibility result that no monotone function of the L² error can uniformly lower-bound any divergence between learned and target marginals, (ii) a tightened KL upper bound that depends only on the observable gradient component (improving the standard Girsanov bound), and (iii) a tractable estimator of the gradient component via a dual Sobolev identity that empirically correlates better with sample quality than the full L² error.

Significance. If the decomposition is valid under the paper's conditions, the work supplies a geometric explanation for the frequent mismatch between L² score error and generative performance, supplies a strictly tighter marginal-space bound, and introduces a practical estimator whose empirical superiority is a concrete strength. The dual-Sobolev construction for the estimator is a technically clean contribution.

major comments (2)
  1. [Section stating the decomposition and the three results] The Helmholtz-Hodge decomposition of the score error e = s_θ − ∇log p is applied on R^d (abstract and the section stating the three results) without any stated decay or integrability conditions at infinity. On unbounded domains the decomposition e = ∇ϕ + ∇×A is unique and the boundary terms vanish in ∫ div(p e) only under assumptions such as |e(x)| = o(1/|x|) or e belonging to a suitable weighted L² space; absent these, the solenoidal component can contribute to the marginal evolution and the impossibility result plus the tightened KL bound both fail.
  2. [Proof of the impossibility result] The impossibility result (that no monotone function of the L² error uniformly lower-bounds divergences) is load-bearing on the claim that the solenoidal component can be made arbitrarily large while leaving the marginal Fokker-Planck equation unchanged. The construction must explicitly verify that the chosen vector field satisfies the decay conditions needed for the decomposition to isolate a purely invisible solenoidal part; otherwise the counter-example is not rigorous.
minor comments (1)
  1. [Empirical evaluation of the estimator] The empirical section reports that the gradient-component estimator correlates substantially better with sample quality, but does not state the precise datasets, number of runs, or statistical tests used; adding these details would strengthen reproducibility without affecting the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the precise identification of the technical conditions required for the Helmholtz-Hodge decomposition on unbounded domains. We address both major comments below and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: The Helmholtz-Hodge decomposition of the score error e = s_θ − ∇log p is applied on R^d (abstract and the section stating the three results) without any stated decay or integrability conditions at infinity. On unbounded domains the decomposition e = ∇ϕ + ∇×A is unique and the boundary terms vanish in ∫ div(p e) only under assumptions such as |e(x)| = o(1/|x|) or e belonging to a suitable weighted L² space; absent these, the solenoidal component can contribute to the marginal evolution and the impossibility result plus the tightened KL bound both fail.

    Authors: We agree that explicit decay and integrability conditions are necessary for the uniqueness of the decomposition and the vanishing of boundary terms on R^d. In the revised manuscript we will add a new subsection (placed immediately before the three main results) that states the precise assumptions: the score error e belongs to a weighted L² space with weight (1+|x|)^{d+ε} for some ε>0, which implies |e(x)|=o(1/|x|) at infinity. Under these conditions the solenoidal component is divergence-free with respect to the weighted measure and does not enter the Fokker-Planck equation. We will also verify that all densities and score networks considered in the paper (smooth, sub-Gaussian tails) satisfy the required integrability, so the three results remain valid. These additions make the geometric argument fully rigorous. revision: yes

  2. Referee: The impossibility result (that no monotone function of the L² error uniformly lower-bounds divergences) is load-bearing on the claim that the solenoidal component can be made arbitrarily large while leaving the marginal Fokker-Planck equation unchanged. The construction must explicitly verify that the chosen vector field satisfies the decay conditions needed for the decomposition to isolate a purely invisible solenoidal part; otherwise the counter-example is not rigorous.

    Authors: We accept that the counter-example construction must be accompanied by an explicit verification of the decay conditions. In the revision we will replace the current sketch with a fully detailed construction: we take a compactly supported, divergence-free vector field A (hence e=∇×A) that is C^∞ and identically zero outside a large ball. Compact support immediately satisfies |e(x)|=o(1/|x|) and all weighted integrability requirements. We then show that the corresponding score perturbation leaves the marginal Fokker-Planck equation invariant while making the L² norm arbitrarily large. The revised proof will cite the new subsection on decay conditions to confirm that the decomposition isolates a purely solenoidal, invisible component. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external mathematical identities

full rationale

The paper's central results—the impossibility theorem, the tightened KL bound, and the gradient-component estimator—are obtained by applying the standard Helmholtz-Hodge decomposition and a dual Sobolev identity to the score error vector field. These are external, well-known facts from vector calculus and functional analysis; the derivations do not reduce any claimed prediction or bound to a fitted parameter, a self-citation chain, or a quantity defined in terms of the target result. No load-bearing step is shown to be equivalent to its inputs by construction, and the manuscript does not invoke uniqueness theorems or ansatzes traceable only to the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the standard mathematical Helmholtz-Hodge decomposition theorem and the Fokker-Planck equation for marginal evolution; no free parameters or new entities are introduced.

axioms (1)
  • standard math Helmholtz-Hodge decomposition applies to score error vector fields and separates components with distinct effects on marginal dynamics
    Invoked to explain why only the gradient component is observable in the marginal distribution.

pith-pipeline@v0.9.1-grok · 5791 in / 1330 out tokens · 69922 ms · 2026-06-30T10:57:29.744452+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Lectures in Mathematics ETH Zürich

    Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré.Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Zürich. Birkhäuser, 2008

  2. [2]

    Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

    Brian DO Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

  3. [3]

    Donald G. Aronson. Bounds for the fundamental solution of a parabolic equation.Bulletin of the American Mathematical Society, 73(6):890–896, 1967

  4. [4]

    Donald G. Aronson. Non-negative solutions of linear parabolic equations.Annali della Scuola Normale Superiore di Pisa, 22(4):607–694, 1968

  5. [5]

    A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem.Numerische Mathematik, 84:375–393, 2000

    Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem.Numerische Mathematik, 84:375–393, 2000

  6. [6]

    V . E. Beneš. Existence of optimal stochastic control laws.SIAM Journal on Control, 9(3):446– 472, 1971

  7. [7]

    Nearly d-linear convergence bounds for diffusion models via stochastic localization

    Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Nearly d-linear convergence bounds for diffusion models via stochastic localization. InThe Twelfth International Conference on Learning Representations, 2024

  8. [8]

    The Helmholtz- Hodge decomposition—a survey.IEEE Transactions on Visualization and Computer Graphics, 19(8):1386–1404, 2013

    Harsh Bhatia, Gregory Norgard, Valerio Pascucci, and Peer-Timo Bremer. The Helmholtz- Hodge decomposition—a survey.IEEE Transactions on Visualization and Computer Graphics, 19(8):1386–1404, 2013

  9. [9]

    Bogachev, Nicolai V

    Vladimir I. Bogachev, Nicolai V . Krylov, Michael Röckner, and Stanislav V . Shaposhnikov. Fokker–Planck–Kolmogorov Equations, volume 207 ofMathematical Surveys and Monographs. American Mathematical Society, 2015

  10. [10]

    Convergence of denoising diffusion models under the manifold hypothesis

    Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022

  11. [11]

    Springer, 2010

    Haim Brezis.Functional Analysis, Sobolev Spaces and Partial Differential Equations. Springer, 2010

  12. [12]

    R. H. Cameron and W. T. Martin. Transformations of weiner integrals under translations.Annals of Mathematics, 45(2):386–396, 1944

  13. [13]

    Time reversal of diffusion processes under a finite entropy condition.Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 59(4):1844 – 1881, 2023

    Patrick Cattiaux, Giovanni Conforti, Ivan Gentil, and Christian Léonard. Time reversal of diffusion processes under a finite entropy condition.Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 59(4):1844 – 1881, 2023

  14. [14]

    On investigating the conservative property of score-based generative models

    Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, and Chun-Yi Lee. On investigating the conservative property of score-based generative models. InInternational Conference on Machine Learning (ICML), 2023

  15. [15]

    Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions

    Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative modeling: User-friendly bounds under minimal smoothness assumptions. InProceedings of the 40th International Conference on Machine Learning, volume 202, 2023. 11

  16. [16]

    Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data.40th International Conference on Machine Learning, ICML, 2023

    Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data.40th International Conference on Machine Learning, ICML, 2023

  17. [17]

    Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

    Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. International Conference on Learning Representations, 2023

  18. [18]

    Springer, 2025

    Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet.Statistical optimal transport. Springer, 2025

  19. [19]

    Perception prioritized training of diffusion models

    Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. InInternational Conference on Computer Vision, 2022

  20. [20]

    Doléans-Dade

    C. Doléans-Dade. Quelques applications de la formule de changement de variables pour les semimartingales.Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 16(3):181– 194, 1970

  21. [21]

    Zehao Dou, Subhodh Kotekal, Zhehao Xu, and Harrison H. Zhou. From optimal score matching to optimal sampling, 2024. arXiv:2409.07032

  22. [22]

    Springer Nature, 2021

    Weinan E, Jiequn Han, and Arnulf Jentzen.Algorithms for Solving High Dimensional PDEs: From Nonlinear Monte Carlo to Machine Learning. Springer Nature, 2021

  23. [23]

    H. Föllmer. An entropy approach to the time reversal of diffusion processes. InStochastic Differential Systems Filtering and Control. Springer Berlin Heidelberg, 1985

  24. [24]

    Divergence-free diffusion models for incompressible fluid flows, 2026

    Wilfried Genuist, Éric Savin, Filippo Gatti, and Didier Clouteau. Divergence-free diffusion models for incompressible fluid flows, 2026

  25. [25]

    On transforming a certain class of stochastic processes by absolutely con- tinuous substitution of measures.Theory of Probability & Its Applications, 5(3):285–301, 1960

    Igor V Girsanov. On transforming a certain class of stochastic processes by absolutely con- tinuous substitution of measures.Theory of Probability & Its Applications, 5(3):285–301, 1960

  26. [26]

    Solving high-dimensional partial differential equations using deep learning.Proceedings of the National Academy of Sciences, 115(34):8505– 8510, 2018

    Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning.Proceedings of the National Academy of Sciences, 115(34):8505– 8510, 2018

  27. [27]

    Maddix, Shima Alizadeh, Gaurav Gupta, and Michael W

    Derek Hansen, Danielle C. Maddix, Shima Alizadeh, Gaurav Gupta, and Michael W. Mahoney. Learning physical models that can respect conservation laws. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

  28. [28]

    U. G. Haussmann and E. Pardoux. Time Reversal of Diffusions.The Annals of Probability, 14(4), 1986

  29. [29]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017

  30. [30]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

  31. [31]

    Multilayer feedforward networks are universal approximators.Neural Networks, 2, 1989

    Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators.Neural Networks, 2, 1989

  32. [32]

    On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models

    Christian Horvat and Jean-Pascal Pfister. On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models. InThe Twelfth International Conference on Learning Representations, 2024

  33. [33]

    Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6:695–709, 2005

    Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6:695–709, 2005. 12

  34. [34]

    Some extensions of score matching

    Aapo Hyvärinen. Some extensions of score matching. InComputational Statistics & Data Analysis, pages 2499–2512. Elsevier, 2007

  35. [35]

    The variational formulation of the fokker– planck equation.SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998

    Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker– planck equation.SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998

  36. [36]

    Springer, 2014

    Ioannis Karatzas and Steven Shreve.Brownian motion and stochastic calculus. Springer, 2014

  37. [37]

    Quantifying Error Propagation and Model Collapse in Diffusion Models

    Nail B. Khelifa, Richard E. Turner, and Ramji Venkataramanan. Quantifying error propagation and model collapse in diffusion models, 2026. arXiv:2602.16601

  38. [38]

    Understanding diffusion objectives as the ELBO with simple data augmentation

    Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. InAdvances in Neural Information Processing Systems, 2023

  39. [39]

    Variational diffusion models

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In Advances in neural information processing systems, 2021

  40. [40]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.CoRR, 2014

  41. [41]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, 2009

  42. [42]

    Springer Publishing Company, Incorporated, 2018

    Jean-Francois Le Gall.Brownian Motion, Martingales, and Stochastic Calculus. Springer Publishing Company, Incorporated, 2018

  43. [43]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11), 1998

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11), 1998

  44. [44]

    A tutorial on energy-based learning.Predicting Structured Data, 1(0), 2006

    Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang. A tutorial on energy-based learning.Predicting Structured Data, 1(0), 2006

  45. [45]

    Beta Sampling is All You Need: Efficient Image Generation Strategy for Diffusion Models Using Stepwise Spectral Analysis

    Haeil Lee, Hansang Lee, Seoyeon Gye, and Junmo Kim. Beta Sampling is All You Need: Efficient Image Generation Strategy for Diffusion Models Using Stepwise Spectral Analysis. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

  46. [46]

    Convergence for score-based generative modeling with polynomial complexity

    Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. InProceedings of the 36th International Conference on Neural Information Processing Systems, 2022

  47. [47]

    Lewis, Oliver Y

    Rebecca M. Lewis, Oliver Y . Feng, Henry W. J. Reeve, Min Xu, and Richard J. Samworth. Learning the score under shape constraints, 2025. arXiv:2512.14624

  48. [48]

    Project and generate: Divergence-free neural operators for incompress- ible flows, 2026

    Xigui Li, Hongwei Zhang, Ruoxi Jiang, Deshu Chen, Chensen Lin, Limei Han, Yuan Qi, Xin Guo, and Yuan Cheng. Project and generate: Divergence-free neural operators for incompress- ible flows, 2026. arXiv:2603.24500

  49. [49]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  50. [50]

    The expressive power of neural networks: A view from the width

    Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. InAdvances in neural information processing systems, 2017

  51. [51]

    A. A. Novikov. On an identity for stochastic integrals.Theory of Probability & Its Applications, 17(4):717–720, 1973

  52. [52]

    Diffusion models are minimax optimal distribution estimators.International Conference on Machine Learning, 2023

    Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators.International Conference on Machine Learning, 2023

  53. [53]

    The geometry of dissipative evolution equations: The porous medium equation

    Felix Otto. The geometry of dissipative evolution equations: The porous medium equation. Communications in Partial Differential Equations, 26(1-2):101–174, 2001

  54. [54]

    Pavliotis.Stochastic Processes and Applications: Diffusion Processes, the Fokker- Planck and Langevin Equations

    Grigorios A. Pavliotis.Stochastic Processes and Applications: Diffusion Processes, the Fokker- Planck and Langevin Equations. Springer, New York, NY , 2014. 13

  55. [55]

    Peebles and Saining Xie

    William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2022

  56. [56]

    Score-based generative models detect manifolds

    Jakiw Pidstrigach. Score-based generative models detect manifolds. InProceedings of the 36th International Conference on Neural Information Processing Systems, 2022

  57. [57]

    Jack Richter-Powell, Yaron Lipman, and Ricky T. Q. Chen. Neural conservation laws: A divergence-free perspective. InAdvances in Neural Information Processing Systems, 2022

  58. [58]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, 2015

  59. [59]

    Should EBMs model the energy or the score? InEnergy Based Models Workshop - ICLR 2021, 2021

    Tim Salimans and Jonathan Ho. Should EBMs model the energy or the score? InEnergy Based Models Workshop - ICLR 2021, 2021

  60. [60]

    Optimal Transport for Applied Mathematicians

    Filippo Santambrogio. Optimal Transport for Applied Mathematicians. Calculus of Variations, PDEs and Modeling.Progress in Nonlinear Differential Equations and Their Applications, 1(87), 2015

  61. [61]

    On approximating∇fwith neural networks, 2019

    Saeed Saremi. On approximating∇fwith neural networks, 2019. arXiv:1910.12744

  62. [62]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InProceedings of the 32nd International Conference on Machine Learning, 2015

  63. [63]

    Maximum likelihood training of score-based diffusion models

    Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. InProceedings of the 35th International Conference on Neural Information Processing Systems, 2021

  64. [64]

    Generative modeling by estimating gradients of the data distribution.Advances in Neural Information Processing Systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in Neural Information Processing Systems, 32, 2019

  65. [65]

    Sliced score matching: A scalable approach to density and score estimation

    Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. InProceedings of The 35th Uncertainty in Artificial Intelligence Conference, Proceedings of Machine Learning Research, 2020

  66. [66]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  67. [67]

    Stroock and S

    Daniel W. Stroock and S. R. Srinivasa Varadhan.Multidimensional Diffusion Processes, volume 233 ofGrundlehren der mathematischen Wissenschaften. Springer, 1979

  68. [68]

    Efficient diffusion training via min-snr weighting strategy

    Hang Tiankai, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. InInternational Conference on Computer Vision, 2023

  69. [69]

    Learning flow distributions via projection- constrained diffusion on manifolds, 2026

    Noah Trupin, Rahul Ghosh, and Aadi Jangid. Learning flow distributions via projection- constrained diffusion on manifolds, 2026. arXiv:2602.17773

  70. [70]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, 2017

  71. [71]

    Springer, 2009

    Cédric Villani.Optimal transport: old and new, volume 338. Springer, 2009

  72. [72]

    A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

    Pascal Vincent. A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

  73. [73]

    Santos, and Yen Ting Lin

    An Vuong, Michael Thompson McCann, Javier E. Santos, and Yen Ting Lin. Are we really learning the score function? Reinterpreting diffusion models through Wasserstein gradient flow matching.Transactions on Machine Learning Research, 2025. 14

  74. [74]

    An analytical theory of spectral bias in the learning dynamics of diffusion models

    Binxu Wang and Cengiz Pehlevan. An analytical theory of spectral bias in the learning dynamics of diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  75. [75]

    Optimal score estimation via empirical Bayes smoothing

    Andre Wibisono, Yihong Wu, and Kaylee Yingxi Yang. Optimal score estimation via empirical Bayes smoothing. InThe Thirty Seventh Annual Conference on Learning Theory, pages 4958–4991, 2024

  76. [76]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, 2017. arXiv:1708.07747

  77. [77]

    Yin, Feng Liang, and Jingbo Liu

    Kaihong Zhang, Caitlyn H. Yin, Feng Liang, and Jingbo Liu. Minimax optimality of score-based diffusion models: beyond the density lower bound assumptions. InProceedings of the 41st International Conference on Machine Learning, 2024

  78. [78]

    Z T t0 σs es ·d ¯BP⋆ s # + 1 2 EP⋆

    Pengze Zhang, Hubery Yin, Chen Li, and Xiaohua Xie. Tackling the singularities at the endpoints of time intervals in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 15 A Proofs of Main Results A.1 Proof of Theorem 3.1 Proof. Proof of (i): Recall that the score estimation error fielde s is...