pith. sign in

arxiv: 2605.20235 · v1 · pith:SK4UNTNInew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine

Pith reviewed 2026-05-21 07:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords diffusion modelsmanifold hypothesisscore matchingdimensional collapselatent diffusionsample complexitygenerative modeling
0
0 comments X

The pith

Diffusion models learn manifold-supported data via score-driven collapse and refinement, making sample complexity depend on intrinsic dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusion models efficiently learn the score function for data supported on low-dimensional manifolds by exploiting a collapse-and-refine process rooted in the score's geometry. At small noise scales the score's diverging singularity forces the denoising map to collapse onto the manifold projection; at moderate scales the same training objective then refines the density on that manifold. This single denoising score matching loss yields both manifold learning and density estimation, removing the need for separate KL regularization used in VAE-based latent diffusion models. The resulting guarantee is that required training samples scale with the manifold's intrinsic dimension rather than the ambient space dimension, explaining how these models avoid the curse of dimensionality on image and molecular data.

Core claim

The geometry of the score function itself produces a collapse-and-refine mechanism: at small noise scales its diverging singularity drives rapid dimensional collapse of the induced denoising map onto the data manifold projection, while at moderate noise scales training refines the intrinsic density on the learned manifold. This principle is realized as Score-induced Latent Diffusion (SiLD), a two-stage framework in which both manifold learning and density estimation emerge from one denoising score matching objective, and it is proved that the sample complexity depends on the intrinsic dimension rather than the ambient dimension.

What carries the argument

The collapse-and-refine mechanism driven by the diverging singularity of the score function at small noise scales, which forces dimensional collapse of the denoising map onto the manifold projection.

If this is right

  • Sample complexity for learning the score scales with the intrinsic dimension of the data manifold instead of ambient dimension.
  • Manifold learning and density estimation both arise from a single denoising score matching objective without heuristic KL regularization.
  • SiLD matches or exceeds generation quality of VAE-based latent diffusion models while improving reconstruction accuracy.
  • The mechanism is validated on Stacked MNIST, CelebA variants, and molecular generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the singularity in the score is absent or removed, dimensional collapse may fail and the method could lose its efficiency advantage in high ambient dimensions.
  • The same collapse-and-refine logic may extend to explain diffusion model performance on other structured domains such as graphs or time series.
  • Conditional versions of SiLD could inherit the intrinsic-dimension scaling for tasks like class-conditional or text-to-image generation.
  • Direct measurement of effective dimension of the denoising map across noise scales on synthetic manifolds would provide an immediate test of the predicted collapse.

Load-bearing premise

The data distribution is supported on a low-dimensional manifold and the score function exhibits a diverging singularity at small noise scales that induces dimensional collapse of the denoising map onto the manifold projection.

What would settle it

An experiment showing that the effective dimension of the learned denoising map remains close to ambient dimension at small noise scales, or that empirical sample complexity scales with ambient rather than intrinsic dimension on controlled low-intrinsic-dimensional data.

Figures

Figures reproduced from arXiv: 2605.20235 by Andi Han, Huanjian Zhou, Kenji Fukumizu, Mingyuan Bai, Qixin Zhang, Taiji Suzuki, Wei Huang.

Figure 1
Figure 1. Figure 1: Two-stage learning dynamics on the Mixture of Gaussian on manifold experiment. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Uncurated samples from Stacked MNIST. Each image is three random MNIST digits [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Denoising and reconstruction on CelebA ( [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗
read the original abstract

Diffusion models generate high-dimensional data with remarkable quality, yet how their training efficiently learns the score function, bypassing the curse of dimensionality when data is supported on low-dimensional manifolds, remains theoretically unexplained. We identify a collapse-and-refine mechanism driven by the geometry of the score function itself: at small noise scales, the diverging singularity of the score drives a rapid dimensional collapse of the induced denoising map onto the data manifold projection; at moderate noise scales, training refines the intrinsic density on the learned manifold. We instantiate this principle as Score-induced Latent Diffusion (SiLD), a two-stage framework in which both manifold learning and density estimation emerge from a single denoising score matching objective, replacing the heuristic KL regularization of VAE-based latent diffusion models. We prove that the resulting sample complexity depends on the intrinsic dimension rather than the ambient dimension. Experiments on Stacked MNIST, CelebA variants, and molecular generation benchmarks show that SiLD matches or outperforms VAE-based LDMs in generation quality and consistently improves reconstruction, validating our theoretical predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies a collapse-and-refine mechanism in diffusion models under the manifold hypothesis: at small noise scales the diverging singularity of the score function drives dimensional collapse of the denoising map onto the manifold projection, while at moderate scales training refines the intrinsic density. This principle is instantiated as Score-induced Latent Diffusion (SiLD), a two-stage framework derived from a single denoising score matching objective that replaces heuristic KL regularization in VAE-based latent diffusion models. The authors prove that the resulting sample complexity depends on the intrinsic dimension rather than the ambient dimension, and report experiments on Stacked MNIST, CelebA variants, and molecular generation benchmarks showing that SiLD matches or outperforms VAE-based LDMs in generation quality while improving reconstruction.

Significance. If the central proof holds, the work supplies a geometric explanation for why diffusion models evade the curse of dimensionality on manifold-supported data and grounds the manifold hypothesis directly in score-function geometry. The SiLD construction is notable for deriving both manifold learning and density estimation from one objective without additional regularization terms or free parameters. The experimental results are presented as direct validation of the theoretical predictions, and the parameter-free character of the sample-complexity claim is a clear strength.

major comments (2)
  1. [Proof of sample-complexity result] Proof of the sample-complexity claim (the section deriving the end-to-end bound from the collapse-and-refine mechanism): the argument that the singularity-induced collapse propagates through score estimation to eliminate all ambient-dimension D dependence must be made explicit. Standard denoising-score-matching analyses bound empirical risk minimization over function classes whose covering numbers or Lipschitz constants scale with D; the manuscript needs to show, via the relevant generalization or optimization bound, that no residual D factor survives once the collapse onto the manifold projection is accounted for.
  2. [SiLD framework description] Definition of the SiLD framework and its relation to the single denoising objective: it is stated that both stages emerge from one score-matching loss, yet the separation into collapse (small-noise) and refine (moderate-noise) phases appears to rely on a noise schedule whose precise form could re-introduce D-dependent estimation rates if the function class remains defined in ambient space. The manuscript should clarify whether the schedule or the function-class restriction is chosen in a way that preserves the claimed D-independence.
minor comments (2)
  1. [Notation] Notation for the intrinsic dimension d versus ambient dimension D should be introduced once at the beginning and used consistently; occasional switches between capital and lower-case D in the theoretical sections reduce readability.
  2. [Experiments] The experimental section reports generation quality metrics but does not include an ablation that isolates the contribution of the collapse stage versus the refine stage; adding such a controlled comparison would strengthen the link between theory and experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important points for clarifying the propagation of the collapse mechanism in the sample-complexity proof and the precise role of the noise schedule in preserving dimension independence. We address each major comment below and have revised the manuscript to strengthen the exposition without altering the core claims or results.

read point-by-point responses
  1. Referee: [Proof of sample-complexity result] Proof of the sample-complexity claim (the section deriving the end-to-end bound from the collapse-and-refine mechanism): the argument that the singularity-induced collapse propagates through score estimation to eliminate all ambient-dimension D dependence must be made explicit. Standard denoising-score-matching analyses bound empirical risk minimization over function classes whose covering numbers or Lipschitz constants scale with D; the manuscript needs to show, via the relevant generalization or optimization bound, that no residual D factor survives once the collapse onto the manifold projection is accounted for.

    Authors: We agree that the propagation step merits a more explicit treatment. In the original manuscript, Lemma 3 establishes that the score singularity at small noise scales forces the denoising map to collapse onto the manifold projection, after which the effective function class for score estimation is supported only in a tubular neighborhood of the manifold. The end-to-end bound in Theorem 1 then invokes covering-number arguments on this restricted class, whose metric entropy scales with the intrinsic dimension d. To address the referee's concern directly, we have inserted a new paragraph immediately following the statement of Lemma 3 that explicitly traces how the collapse eliminates residual D factors in both the optimization error (via restricted Lipschitz constants) and the generalization error (via covering numbers of the projected function class). The revised proof now cites the relevant generalization bound from the score-matching literature and shows that the ambient dimension D appears only in transient terms that vanish once collapse occurs. revision: yes

  2. Referee: [SiLD framework description] Definition of the SiLD framework and its relation to the single denoising objective: it is stated that both stages emerge from one score-matching loss, yet the separation into collapse (small-noise) and refine (moderate-noise) phases appears to rely on a noise schedule whose precise form could re-introduce D-dependent estimation rates if the function class remains defined in ambient space. The manuscript should clarify whether the schedule or the function-class restriction is chosen in a way that preserves the claimed D-independence.

    Authors: The SiLD construction is obtained by partitioning the single denoising score-matching objective across noise scales without introducing extra regularization or parameters. The noise schedule is selected so that the small-noise regime triggers the geometric collapse proven in Lemma 3, after which the subsequent moderate-noise regime operates on the already-collapsed manifold. Because the training dynamics themselves enforce the restriction to the manifold (rather than an a-priori ambient function class), the covering numbers and Lipschitz constants in the generalization analysis remain governed by d. We have added a clarifying remark in Section 3.1 that explicitly states this point and cross-references the proof in Section 4 to confirm that no D-dependent rates are re-introduced by the schedule. revision: yes

Circularity Check

0 steps flagged

No circularity: proof derives sample complexity from geometric collapse without reducing to inputs by construction

full rationale

The paper presents a theoretical derivation of sample complexity depending on intrinsic dimension via the collapse-and-refine mechanism, where score singularity at small noise induces dimensional collapse of the denoising map onto the manifold projection, followed by refinement at moderate scales. The abstract and description frame this as emerging from a single denoising score matching objective instantiated as SiLD, with the proof claimed to follow from first-principles geometric analysis rather than any fitted parameter, self-citation chain, or definitional equivalence. No load-bearing step reduces the claimed result to a tautology or renamed input; the central claim retains independent mathematical content from the manifold hypothesis and score geometry. This qualifies as a self-contained theoretical contribution with no detected circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the manifold hypothesis and on specific regularity properties of the score function at different noise scales; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Data distribution is supported on a low-dimensional manifold
    Invoked to explain why the score singularity drives dimensional collapse
  • domain assumption Score function exhibits diverging singularity at small noise scales
    Used to derive the rapid collapse of the denoising map onto the manifold projection
invented entities (1)
  • Score-induced Latent Diffusion (SiLD) no independent evidence
    purpose: Two-stage framework that performs manifold learning and density estimation from a single denoising score matching objective
    New training procedure replacing heuristic KL regularization of VAE-based LDMs

pith-pipeline@v0.9.0 · 5727 in / 1326 out tokens · 29730 ms · 2026-05-21T07:32:13.100711+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 7 internal anchors

  1. [1]

    Sgd learning on neural net- works: leap complexity and saddle-to-saddle dynamics

    Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. Sgd learning on neural net- works: leap complexity and saddle-to-saddle dynamics. InThe Thirty Sixth Annual Conference on Learning Theory, pages 2552–2623. PMLR, 2023

  2. [2]

    Convergence of dif- fusion models under the manifold hypothesis in high-dimensions.arXiv preprint arXiv:2409.18804,

    Iskander Azangulov, George Deligiannidis, and Judith Rousseau. Convergence of diffusion models under the manifold hypothesis in high-dimensions.arXiv preprint arXiv:2409.18804, 2024

  3. [3]

    Nearly d-linear convergence bounds for diffu- sion models via stochastic localization.arXiv preprint arXiv:2308.03686,

    Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Nearly d- linear convergence bounds for diffusion models via stochastic localization.arXiv preprint arXiv:2308.03686, 2023. 11

  4. [4]

    Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

    G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

  5. [5]

    Dynamical regimes of diffusion models.Nature Communications, 15(1):9957, 2024

    Giulio Biroli, Tony Bonnaire, Valentin De Bortoli, and Marc Mézard. Dynamical regimes of diffusion models.Nature Communications, 15(1):9957, 2024

  6. [6]

    Shallow diffusion networks provably learn hidden low-dimensional structure.arXiv preprint arXiv:2410.11275, 2024

    Nicholas M Boffi, Arthur Jacot, Stephen Tu, and Ingvar Ziemann. Shallow diffusion networks provably learn hidden low-dimensional structure.arXiv preprint arXiv:2410.11275, 2024

  7. [7]

    & Mézard, M.Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in TrainingarXiv:2505.17638 [cs]

    Tony Bonnaire, Raphaël Urfin, Giulio Biroli, and Marc Mézard. Why diffusion models don’t memorize: The role of implicit dynamical regularization in training.arXiv preprint arXiv:2505.17638, 2025

  8. [8]

    Generalization Properties of Score-matching Diffusion Models for Intrinsically Low-dimensional Data

    Saptarshi Chakraborty, Quentin Berthet, and Peter L Bartlett. Generalization properties of score-matching diffusion models for intrinsically low-dimensional data.arXiv preprint arXiv:2603.03700, 2026

  9. [9]

    When and how can inexact generative models still sample from the data manifold?arXiv preprint arXiv:2508.07581, 2025

    Nisha Chandramoorthy and Adriaan de Clercq. When and how can inexact generative models still sample from the data manifold?arXiv preprint arXiv:2508.07581, 2025

  10. [10]

    Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data

    Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. InInternational Conference on Machine Learning, pages 4672–4712. PMLR, 2023

  11. [11]

    arXiv preprint arXiv:2202.01009 , year=

    Lénaïc Chizat. Mean-field langevin dynamics: Exponential convergence and annealing.arXiv preprint arXiv:2202.01009, 2022

  12. [12]

    A precise asymptotic analysis of learning diffusion models: theory and insights.arXiv e-prints, pages arXiv–2501, 2025

    Hugo Cui, Cengiz Pehlevan, and Yue M Lu. A precise asymptotic analysis of learning diffusion models: theory and insights.arXiv e-prints, pages arXiv–2501, 2025

  13. [13]

    High-dimensional asymptotics of denoising autoencoders

    Hugo Cui and Lenka Zdeborová. High-dimensional asymptotics of denoising autoencoders. Advances in Neural Information Processing Systems, 36:11850–11890, 2023

  14. [14]

    Neural networks can learn represen- tations with gradient descent

    Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi. Neural networks can learn represen- tations with gradient descent. InConference on Learning Theory, pages 5413–5452. PMLR, 2022

  15. [15]

    Convergence of denoising diffusion models under the manifold hypothesis

    Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022

  16. [16]

    Diffusion models and the manifold hypothesis: Log-domain smoothing is geometry adaptive

    Tyler Farghly, Peter Potaptchik, Samuel Howard, George Deligiannidis, and Jakiw Pidstrigach. Diffusion models and the manifold hypothesis: Log-domain smoothing is geometry adaptive. arXiv preprint arXiv:2510.02305, 2025

  17. [17]

    Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491, 1959

    Herbert Federer. Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491, 1959

  18. [18]

    Testing the manifold hypothesis

    Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016

  19. [19]

    Flow matching from viewpoint of proximal operators.arXiv preprint arXiv:2602.12683, 2026

    Kenji Fukumizu, Wei Huang, Han Bao, Shuntuo Xu, and Nisha Chandramoothy. Flow matching from viewpoint of proximal operators.arXiv preprint arXiv:2602.12683, 2026

  20. [20]

    Kaiser et al

    Weiguo Gao and Ming Li. How do flow matching models memorize and generalize in sample data subspaces?arXiv preprint arXiv:2410.23594, 2024

  21. [21]

    Asymptotic Learning Curves for Diffusion Models with Random Features Score and Manifold Data

    Anand Jerry George and Nicolas Macris. Asymptotic learning curves for diffusion models with random features score and manifold data.arXiv preprint arXiv:2603.22962, 2026

  22. [22]

    Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

    Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018. 12

  23. [23]

    On the feature learning in diffusion models

    Andi Han, Wei Huang, Yuan Cao, and Difan Zou. On the feature learning in diffusion models. arXiv preprint arXiv:2412.01021, 2024

  24. [24]

    Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization

    Yinbin Han, Meisam Razaviyayn, and Renyuan Xu. Neural network-based score estimation in diffusion models: Optimization and generalization.arXiv preprint arXiv:2401.15604, 2024

  25. [25]

    Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

    Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, and Tim Salimans. Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

  26. [26]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  27. [27]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  28. [28]

    Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionality.Mathematics of Operations Research, 2026

    Zhihan Huang, Yuting Wei, and Yuxin Chen. Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionality.Mathematics of Operations Research, 2026

  29. [29]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

  30. [30]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  31. [31]

    Self-referencing embedded strings (selfies): A 100% robust molecular string representation

    Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020

  32. [32]

    Flow Matching is Adaptive to Manifold Structures

    Shivam Kumar, Yixin Wang, and Lizhen Lin. Flow matching is adaptive to manifold structures. arXiv preprint arXiv:2602.22486, 2026

  33. [33]

    Adapting to unknown low-dimensional structures in score-based diffusion models.Advances in Neural Information Processing Systems, 37:126297–126331, 2024

    Gen Li and Yuling Yan. Adapting to unknown low-dimensional structures in score-based diffusion models.Advances in Neural Information Processing Systems, 37:126297–126331, 2024

  34. [34]

    When scores learn geometry: Rate separations under the manifold hypothesis

    Xiang Li, Zebang Shen, Ya-Ping Hsieh, and Niao He. When scores learn geometry: Rate separations under the manifold hypothesis. InThe Fourteenth International Conference on Learning Representations, 2026

  35. [35]

    Understand- ing representation dynamics of diffusion models via low-dimensional modeling.arXiv preprint arXiv:2502.05743, 2025

    Xiao Li, Zekai Zhang, Xiang Li, Siyi Chen, Zhihui Zhu, Peng Wang, and Qing Qu. Understand- ing representation dynamics of diffusion models via low-dimensional modeling.arXiv preprint arXiv:2502.05743, 2025

  36. [36]

    Improving the euclidean diffusion generation of manifold data by mitigating score function singularity.arXiv preprint arXiv:2505.09922, 2025

    Zichen Liu, Wei Zhang, and Tiejun Li. Improving the euclidean diffusion generation of manifold data by mitigating score function singularity.arXiv preprint arXiv:2505.09922, 2025

  37. [37]

    Deep generative models through the lens of the manifold hypothesis: A survey and new connections.arXiv preprint arXiv:2404.02954, 2024

    Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hosseinzadeh, Anthony L Caterini, and Jesse C Cresswell. Deep generative models through the lens of the manifold hypothesis: A survey and new connections.arXiv preprint arXiv:2404.02954, 2024

  38. [38]

    A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665– E7671, 2018

    Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665– E7671, 2018

  39. [39]

    [MMM22] Song Mei, Theodor Misiakiewicz, and Andrea Montanari

    Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A Erdogdu. Neural networks efficiently learn low-dimensional representations with sgd.arXiv preprint arXiv:2209.14863, 2022

  40. [40]

    Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

    Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan SC Lim, and Prudencio Tossou. Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024. 13

  41. [41]

    Diffusion models are minimax optimal distribution estimators

    Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. InInternational Conference on Machine Learning, pages 26517–26582. PMLR, 2023

  42. [42]

    Score-based generative models detect manifolds.Advances in Neural Information Processing Systems, 35:35852–35865, 2022

    Jakiw Pidstrigach. Score-based generative models detect manifolds.Advances in Neural Information Processing Systems, 35:35852–35865, 2022

  43. [43]

    Approximation theory of the mlp model in neural networks.Acta numerica, 8:143–195, 1999

    Allan Pinkus. Approximation theory of the mlp model in neural networks.Acta numerica, 8:143–195, 1999

  44. [44]

    Linear convergence of diffusion models under the manifold hypothesis.arXiv preprint arXiv:2410.09046, 2024

    Peter Potaptchik, Iskander Azangulov, and George Deligiannidis. Linear convergence of diffusion models under the manifold hypothesis.arXiv preprint arXiv:2410.09046, 2024

  45. [45]

    Fréchet chemnet distance: a metric for generative models for molecules in drug discovery

    Kristina Preuer, Philipp Renz, Thomas Unterthiner, Sepp Hochreiter, and Gunter Klambauer. Fréchet chemnet distance: a metric for generative models for molecules in drug discovery. Journal of chemical information and modeling, 58(9):1736–1741, 2018

  46. [46]

    A de novo molecular generation method using latent vector based generative adversarial network.Journal of cheminformatics, 11(1):74, 2019

    Oleksii Prykhodko, Simon Viet Johansson, Panagiotis-Christos Kotsias, Josep Arús-Pous, Esben Jannik Bjerrum, Ola Engkvist, and Hongming Chen. A de novo molecular generation method using latent vector based generative adversarial network.Journal of cheminformatics, 11(1):74, 2019

  47. [47]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  48. [48]

    Generating focused molecule libraries for drug discovery with recurrent neural networks.ACS central science, 4(1):120–131, 2018

    Marwin HS Segler, Thierry Kogej, Christian Tyrchan, and Mark P Waller. Generating focused molecule libraries for drug discovery with recurrent neural networks.ACS central science, 4(1):120–131, 2018

  49. [49]

    Learning mixtures of gaussians using the ddpm objective.Advances in Neural Information Processing Systems, 36:19636–19649, 2023

    Kulin Shah, Sitan Chen, and Adam Klivans. Learning mixtures of gaussians using the ddpm objective.Advances in Neural Information Processing Systems, 36:19636–19649, 2023

  50. [50]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

  51. [51]

    Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

  52. [52]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  53. [53]

    Diffusion models encode the intrinsic dimension of data manifolds

    Jan Pawel Stanczuk, Georgios Batzolis, Teo Deveney, and Carola-Bibiane Schönlieb. Diffusion models encode the intrinsic dimension of data manifolds. InForty-first International Conference on Machine Learning, 2024

  54. [54]

    Taiji Suzuki, Denny Wu, and Atsushi Nitanda. Convergence of mean-field langevin dynamics: time-space discretization, stochastic gradient, and variance reduction.Advances in Neural Information Processing Systems, 36:15545–15577, 2023

  55. [55]

    Adaptivity of diffusion models to manifold structures

    Rong Tang and Yun Yang. Adaptivity of diffusion models to manifold structures. InInternational conference on artificial intelligence and statistics, pages 1648–1656. PMLR, 2024

  56. [56]

    Score-based generative modeling in latent space

    Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. Advances in neural information processing systems, 34:11287–11302, 2021

  57. [57]

    A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661–1674, 2011

    Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661–1674, 2011

  58. [58]

    Cambridge university press, 2019

    Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. 14

  59. [59]

    An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models

    Binxu Wang and Cengiz Pehlevan. An analytical theory of spectral bias in the learning dynamics of diffusion models.arXiv preprint arXiv:2503.03206, 2025

  60. [60]

    Diffusion models generate images like painters: an analytical theory of outline first, details later

    Binxu Wang and John J Vastola. Diffusion models generate images like painters: an analytical theory of outline first, details later.arXiv preprint arXiv:2303.02490, 2023

  61. [61]

    Diffusion models learn low-dimensional distributions via subspace clustering

    Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, and Qing Qu. Diffusion models learn low-dimensional distributions via subspace clustering. In2025 IEEE 10th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 211–215. IEEE, 2025

  62. [62]

    cmolgpt: a conditional generative pre-trained transformer for target-specific de novo molecular generation.Molecules, 28(11):4430, 2023

    Ye Wang, Honggang Zhao, Simone Sciabola, and Wenlu Wang. cmolgpt: a conditional generative pre-trained transformer for target-specific de novo molecular generation.Molecules, 28(11):4430, 2023

  63. [63]

    When diffusion models memorize: Inductive biases in probability flow of minimum-norm shallow neural nets.arXiv preprint arXiv:2506.19031, 2025

    Chen Zeno, Hila Manor, Greg Ongie, Nir Weinberger, Tomer Michaeli, and Daniel Soudry. When diffusion models memorize: Inductive biases in probability flow of minimum-norm shallow neural nets.arXiv preprint arXiv:2506.19031, 2025

  64. [64]

    Analyzing neural network-based generative diffusion models through convex optimization.arXiv preprint arXiv:2402.01965, 2024

    Fangzhao Zhang and Mert Pilanci. Analyzing neural network-based generative diffusion models through convex optimization.arXiv preprint arXiv:2402.01965, 2024

  65. [65]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 15 A Limitations Our work is deliberately scoped to characterize the training dynamics of score matchin...

  66. [66]

    (Kernel regularity.)The kernel K belongs to C r(M × M), hence to H r−k/2(M × M) by Sobolev embedding

  67. [67]

    21 The polynomial decay rate j−r/k depends on theintrinsicdimension k rather than d, which is the key to controlling the ambient dependence in Stage 2

    (Eigenvalue decay.)The eigenvalues {λj}j≥1 of the induced integral operator TK : L2(M)→L 2(M)satisfy λj ≤C M,r ∥K∥ Cr j−r/k,(38) whereC M,r depends only on(M, g)andr, not on the ambient dimensiond. 21 The polynomial decay rate j−r/k depends on theintrinsicdimension k rather than d, which is the key to controlling the ambient dependence in Stage 2. Proof.B...

  68. [68]

    SiLD consistently outperforms LDM-CNN on reconstruction MSE across all settings, with the gap present at the smaller network (0.00440 vs

    Table 4 reports results across two model sizes and two training budgets. SiLD consistently outperforms LDM-CNN on reconstruction MSE across all settings, with the gap present at the smaller network (0.00440 vs. 0.00503) and persisting at the larger network (0.00345 vs. 0.00396). At 10× training, both methods converge to near-identical reconstruction MSE (...