A theory of learning data statistics in diffusion models, from easy to hard
Pith reviewed 2026-05-21 11:28 UTC · model grok-4.3
The pith
Diffusion models learn pairwise input statistics at linear sample complexity before higher-order correlations like the fourth cumulant, which requires cubic complexity unless latent structures correlate them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the mixed cumulant model the denoiser learns pairwise statistics of the inputs at linear sample complexity while fourth-order cumulants require at least cubic sample complexity; the sample complexity of the fourth cumulant falls back to linear once pairwise and higher-order statistics are tied together by a shared latent structure. The governing quantity is a scalar invariant of the model called the diffusion information exponent.
What carries the argument
The diffusion information exponent, a scalar invariant extracted from the mixed cumulant model that sets the sample complexity required to recover statistics of a given order.
If this is right
- Pairwise correlations are recovered first during training, producing the distributional simplicity bias seen on natural images.
- Fourth-order cumulants are recovered only after the linear regime unless they share latent factors with the pairwise terms.
- The diffusion information exponent directly predicts the sample threshold at which each order of statistic becomes learnable.
- The staged acquisition of statistics offers a mechanism for how diffusion models build distributions of rising complexity.
Where Pith is reading between the lines
- The same exponent may govern learning order in other score-based or flow-based generative models that rely on denoising.
- Synthetic datasets with tunable latent correlations could be used to test whether real image statistics follow the predicted linear-to-cubic transition.
- If the exponent can be estimated from data, it might guide choices of training schedule or model capacity to accelerate acquisition of higher-order features.
Load-bearing premise
The mixed cumulant model is a faithful minimal representation of the statistical structure present in natural images.
What would settle it
Train a small denoiser on samples from the mixed cumulant model with controlled latent correlation between second- and fourth-order terms and measure whether the number of samples needed to recover the fourth cumulant scales linearly or cubically.
Figures
read the original abstract
While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images exhibit a distributional simplicity bias, learning simple, pair-wise input statistics before specializing to higher-order correlations. We reproduce this behaviour in simple denoisers trained on a minimal data model, the mixed cumulant model, where we precisely control both pair-wise and higher-order correlations of the inputs. We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent, in analogy to related invariants in different learning paradigms. Using this invariant, we prove that the denoiser learns simple, pair-wise statistics of the inputs at linear sample complexity, while more complex higher-order statistics, such as the fourth cumulant, require at least cubic sample complexity. We also prove that the sample complexity of learning the fourth cumulant is linear if pair-wise and higher-order statistics share a correlated latent structure. Our work describes a key mechanism for how diffusion models can learn distributions of increasing complexity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically demonstrates a distributional simplicity bias in diffusion models trained on natural images, where pair-wise statistics are learned before higher-order correlations. To explain this, the authors introduce a mixed cumulant model allowing control over pair-wise and higher-order correlations, and define a scalar invariant called the diffusion information exponent that governs sample complexity. They claim to prove that the denoiser learns pair-wise statistics at linear sample complexity, higher-order statistics such as the fourth cumulant at least cubic sample complexity, and linear complexity for the fourth cumulant when pair-wise and higher-order statistics share a correlated latent structure. This is positioned as a mechanism for how diffusion models learn distributions of increasing complexity.
Significance. If the proofs are complete and gap-free and the mixed cumulant model faithfully captures the relevant statistical structures driving the bias in natural images, this work would provide a valuable theoretical account of simplicity biases in diffusion models via an information-theoretic invariant. It draws an analogy to similar exponents in other learning settings and offers a controlled setting in which to analyze the denoiser objective, which could inform analyses of generative model training dynamics.
major comments (2)
- [Mixed cumulant model and diffusion information exponent] The diffusion information exponent is introduced as a scalar invariant of the mixed cumulant model that governs the sample complexities. It must be verified that the definition of this exponent (and the associated reduction from the denoiser objective) is independent of the target linear/cubic bounds rather than constructed to produce them, to avoid any risk of circularity in the central claims.
- [Proofs for sample complexity bounds] The abstract states that proofs exist for the linear and cubic bounds and for the latent-structure case. These derivations should be examined in detail for gaps, particularly in how the information exponent is applied to bound the sample complexity of learning the fourth cumulant from the denoiser objective.
minor comments (1)
- [Empirical results] The empirical reproduction of the simplicity bias in simple denoisers on the mixed cumulant model is useful; additional details on network architectures, training hyperparameters, and quantitative metrics used to measure when statistics are learned would strengthen reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and detailed report. Their comments highlight important points about the foundations of our central claims, and we have used this feedback to strengthen the presentation of the mixed cumulant model and the associated proofs. We address each major comment below.
read point-by-point responses
-
Referee: [Mixed cumulant model and diffusion information exponent] The diffusion information exponent is introduced as a scalar invariant of the mixed cumulant model that governs the sample complexities. It must be verified that the definition of this exponent (and the associated reduction from the denoiser objective) is independent of the target linear/cubic bounds rather than constructed to produce them, to avoid any risk of circularity in the central claims.
Authors: We appreciate the referee's concern and agree that circularity must be avoided. The diffusion information exponent is defined directly from the mixed cumulant model prior to any sample-complexity analysis: it is the infimum of α > 0 such that the fourth-order cumulant tensor is controlled by the pairwise covariance raised to the power α/2 in the appropriate tensor norm (see Definition 3.2). This definition is motivated by the structure of the data-generating process and draws an explicit analogy to information exponents appearing in other learning settings (e.g., tensor PCA and planted problems). The reduction from the denoiser objective to cumulant estimation follows from the explicit form of the score function under the mixed cumulant model and holds for any α; it does not presuppose the values 1 or 3. Only after these steps do we instantiate the model with independent versus correlated latents and obtain the concrete exponents α = 1 (pairwise) and α = 3 (fourth cumulant). We have added a new clarifying paragraph immediately after Definition 3.2 that states this logical order explicitly. revision: partial
-
Referee: [Proofs for sample complexity bounds] The abstract states that proofs exist for the linear and cubic bounds and for the latent-structure case. These derivations should be examined in detail for gaps, particularly in how the information exponent is applied to bound the sample complexity of learning the fourth cumulant from the denoiser objective.
Authors: We have re-examined the proofs in the appendix. The argument proceeds by (i) showing that the population denoiser recovers the relevant cumulants exactly, (ii) bounding the deviation of the empirical denoiser from the population one via a Lipschitz property of the diffusion loss, and (iii) applying a matrix/tensor concentration inequality whose rate is governed by the diffusion information exponent α. For the fourth cumulant without latent correlation, α = 3 yields the cubic sample-complexity lower and upper bounds. When pairwise and higher-order statistics share a latent factor, the effective exponent collapses to 1, recovering linear complexity. The steps are spelled out in Lemmas B.3–B.7 and Theorem 4.3. To make the application of the exponent more transparent, we have inserted an expanded proof sketch in Section 4.2 and added an intermediate lemma that isolates the role of α. We believe the derivations contain no gaps, but we welcome further scrutiny of the revised appendix. revision: yes
Circularity Check
Derivation self-contained; diffusion information exponent derived independently and used in genuine proofs
full rationale
The paper first empirically observes distributional simplicity bias in diffusion models on natural images, then constructs a minimal mixed cumulant model to control pairwise and higher-order statistics. It identifies the diffusion information exponent as a scalar invariant of this model (in analogy to other learning paradigms) and deploys it to prove linear sample complexity for pairwise statistics and cubic for the fourth cumulant (linear under correlated latents). These proofs are presented as mathematical derivations from the model's structure rather than tautological re-statements of fitted parameters or self-referential definitions. No load-bearing step reduces by construction to the target result; the central claims rest on explicit proofs whose assumptions are stated separately from the conclusions. The extrapolation to natural images is framed as an assumption rather than a derived necessity, keeping the internal derivation chain non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The mixed cumulant model is a faithful minimal representation of the statistical structure present in natural images.
invented entities (1)
-
diffusion information exponent
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent... k⋆... ατ+1 = ατ + η/d cL_k⋆ cF_k⋆−1 α^k⋆−1_τ + O(α^k⋆_τ)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mixed cumulant model... xμ = √βu λμ u + √βv νμ v + zμ... recovery of the cumulant spike v... k∗=4... cubic sample complexity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Distributional simplicity bias and effective convexity in Energy Based Models
Gradient flow in energy-based models for strictly positive binary distributions produces stable data-consistent fixed points and a learning hierarchy that favors lower-order interactions first, mechanistically explain...
-
Understanding diffusion models requires rethinking (again) generalization
Diffusion models require new generalization frameworks because memorization and novel generation are incompatible, so research should focus on what models learn before memorization begins.
Reference graph
Works this paper leans on
-
[1]
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S.Deep Unsupervised Learning using Nonequilibrium ThermodynamicsinProceedings of the 32nd International Conference on Machine Learning(eds Bach, F. & Blei, D.)37(PMLR, Lille, France, 2015), 2256–2265 (cit. on p. 1)
work page 2015
-
[2]
Denoising Diffusion Probabilistic Models
Ho, J., Jain, A. & Abbeel, P.Denoising Diffusion Probabilistic ModelsarXiv:2006.11239 [cs, stat]. Dec. 2020 (cit. on pp. 1, 3)
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[3]
Song, Y. & Ermon, S.Generative Modeling by Estimating Gradients of the Data Distributionin Advances in Neural Information Processing Systems32(Curran Associates, Inc., 2019) (cit. on pp. 1, 3)
work page 2019
-
[4]
Saad, D. & Solla, S. Exact Solution for On-Line Learning in Multilayer Neural Networks.Phys. Rev. Lett.74,4337–4340 (1995) (cit. on p. 1)
work page 1995
-
[5]
Saxe, A. M., McClelland, J. L. & Ganguli, S.Exact solutions to the nonlinear dynamics of learning in deep linear neural networksinICLR(2014) (cit. on p. 1)
work page 2014
-
[6]
Saxe, A. M., McClelland, J. L. & Ganguli, S. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences116,11537–11546 (2019) (cit. on p. 1). 12
work page 2019
-
[7]
Abbe, E., Adsera, E. B. & Misiakiewicz, T.Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamicsinThe Thirty Sixth Annual Conference on Learning Theory(2023), 2552– 2623 (cit. on p. 1)
work page 2023
-
[8]
Dandi, Y., Krzakala, F., Loureiro, B., Pesce, L. & Stephan, L. How Two-Layer Neural Networks Learn, One (Giant) Step at a Time.Journal of Machine Learning Research25,1–65 (2024) (cit. on pp. 1, 5)
work page 2024
- [9]
-
[10]
Kögler, K., Shevchenko, A., Hassani, H. & Mondelli, M.Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and DepthinForty-first International Conference on Machine Learning(2024) (cit. on p. 1)
work page 2024
-
[11]
Kalimeris, D.et al. SGD on Neural Networks Learns Functions of Increasing ComplexityinAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada(eds Wallach, H. M.et al.) (2019), 3491–3501 (cit. on p. 1)
work page 2019
-
[12]
& Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018) (cit
Farnia, F., Zhang, J. & Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018) (cit. on p. 1)
work page 2018
-
[13]
On the Spectral Bias of Neural NetworksinProc
Rahaman, N.et al. On the Spectral Bias of Neural NetworksinProc. of ICML(eds Chaudhuri, K. & Salakhutdinov, R.)97(PMLR, 2019), 5301–5310 (cit. on p. 1)
work page 2019
-
[14]
Ingrosso, A. & Goldt, S. Data-driven emergence of convolutional structure in neural networks. Proceedings of the National Academy of Sciences119,e2201854119 (2022) (cit. on p. 1)
work page 2022
-
[15]
Merger, C.et al.Learning Interacting Theories from Data.Physical Review X13.Publisher: American Physical Society, 041033 (Nov. 2023) (cit. on p. 1)
work page 2023
-
[16]
Refinetti, M., Ingrosso, A. & Goldt, S.Neural networks trained with SGD learn distributions of increasing complexityinInternational Conference on Machine Learning(2023), 28843–28863 (cit. on pp. 1, 3)
work page 2023
-
[17]
Bardone, L. & Goldt, S.Sliding Down the Stairs: How Correlated Latent Variables Accelerate Learning with Neural NetworksinProceedings of the 41st International Conference on Machine Learning235 (PMLR, 2024), 3024–3045 (cit. on pp. 1, 4, 5, 8, 11, 23)
work page 2024
-
[18]
Rende, R., Gerace, F., Laio, A. & Goldt, S. A distributional simplicity bias in the learning dynamics of transformers.Advances in Neural Information Processing Systems37,96207–96228 (2024) (cit. on p. 1)
work page 2024
- [19]
-
[20]
& Wyart, M.How compositional generalization and creativity improve as diffusion models are traineden
Favero, A., Sclocchi, A., Cagnetta, F., Frossard, P. & Wyart, M.How compositional generalization and creativity improve as diffusion models are traineden. arXiv:2502.12089 [stat]. Mar. 2025 (cit. on p. 1)
-
[21]
Garnier-Brun, J., Mézard, M., Moscato, E. & Saglietti, L.How Transformers Learn Structured Data: Insights From Hierarchical FilteringinForty-second International Conference on Machine Learning (2025) (cit. on p. 1)
work page 2025
-
[22]
U-Net: Convolutional Networks for Biomedical Image Segmentation
Ronneberger, O., Fischer, P. & Brox, T.U-Net: Convolutional Networks for Biomedical Image Seg- mentationarXiv:1505.04597 [cs]. May 2015 (cit. on pp. 2, 3, 16)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[23]
Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure
Li, X., Dai, Y. & Qu, Q.Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure2024. arXiv:2410.24060 [cs.LG](cit. on p. 2)
-
[24]
Bonnaire, T., Urfin, R., Biroli, G. & Mézard, M.Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in TrainingarXiv:2505.17638 [cs]. May 2025 (cit. on p. 2)
-
[25]
George, A. J., Veiga, R. & Macris, N.Analysis of Diffusion Models for Manifold DataarXiv:2502.04339 [math]. Feb. 2025 (cit. on p. 2). 13
-
[26]
& Goldt, S.Generalization Dynamics of Linear Diffusion ModelsMay 2025 (cit
Merger, C. & Goldt, S.Generalization Dynamics of Linear Diffusion ModelsMay 2025 (cit. on p. 2)
work page 2025
-
[27]
Wang, B. & Pehlevan, C.An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion ModelsinThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025) (cit. on p. 2)
work page 2025
-
[28]
Refinetti, M. & Goldt, S.The dynamics of representation learning in shallow, non-linear autoencoders inInternational Conference on Machine Learning(2022), 18499–18519 (cit. on pp. 2, 5)
work page 2022
-
[29]
Cui, H. & Zdeborová, L. High-dimensional asymptotics of denoising autoencoders.Advances in Neural Information Processing Systems36,11850–11890 (2023) (cit. on p. 2)
work page 2023
-
[30]
Cui, H., Krzakala, F., Vanden-Eijnden, E. & Zdeborova, L.Analysis of Learning a Flow-based Gener- ative Model from Limited Sample ComplexityinThe Twelfth International Conference on Learning Representations(2024) (cit. on p. 2)
work page 2024
-
[31]
Cui, H., Pehlevan, C. & Lu, Y. M.A solvable model of learning generative diffusion: theory and insights inThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025) (cit. on pp. 2, 5)
work page 2025
-
[32]
Ben Arous, G., Gheissari, R. & Jagannath, A. Online Stochastic Gradient Descent on Non-Convex Losses from High-Dimensional Inference.J. Mach. Learn. Res.22(2021) (cit. on pp. 3, 5–7, 22)
work page 2021
-
[33]
Krizhevsky, A.Learning Multiple Layers of Features from Tiny Images2009 (cit. on p. 3)
-
[34]
Biroli, G. & Mézard, M. Generative diffusion in very large dimensions. en.Journal of Statistical Mechanics: Theory and Experiment2023,093402 (Sept. 2023) (cit. on pp. 4, 20)
work page 2023
-
[35]
Anderson, B. D. Reverse-time diffusion equation models. en.Stochastic Processes and their Applica- tions12,313–326 (May 1982) (cit. on p. 4)
work page 1982
-
[36]
& Klivans, A.Learning Mixtures of Gaussians Using the DDPM Objective2023
Shah, K., Chen, S. & Klivans, A.Learning Mixtures of Gaussians Using the DDPM Objective2023. arXiv:2307.01178 [cs.DS](cit. on pp. 4, 21)
- [37]
-
[38]
Damian, A., Pillaud-Vivien, L., Lee, J. & Bruna, J.Computational-Statistical Gaps in Gaussian Single- Index Models (Extended Abstract)inProceedings of Thirty Seventh Conference on Learning Theory (eds Agrawal, S. & Roth, A.)247(PMLR, 2024), 1262–1262 (cit. on p. 5)
work page 2024
-
[39]
Székely, E., Bardone, L., Gerace, F. & Goldt, S. Learning from higher-order correlations, efficiently: hypothesis tests, random features, and neural networks. en.Advances in Neural Information Pro- cessing Systems37,78479–78522 (Dec. 2024) (cit. on pp. 7, 21)
work page 2024
-
[40]
Biroli, G., Cammarota, C. & Ricci-Tersenghi, F. How to iron out rough landscapes and get optimal performances: averaged gradient descent and its application to tensor PCA.Journal of Physics A: Mathematical and Theoretical53,174003 (2020) (cit. on p. 7)
work page 2020
- [41]
-
[42]
Ricci, F., Bardone, L. & Goldt, S. Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensions.arXiv preprint arXiv:2503.23896(2025) (cit. on p. 7)
-
[43]
Engel, A. & Broeck, C. V. D.Statistical Mechanics of Learning(Cambridge University Press, 2001) (cit. on p. 10)
work page 2001
-
[44]
Safran, I. M., Yehudai, G. & Shamir, O.The effects of mild over-parameterization on the optimization landscape of shallow relu neural networksinConference on Learning Theory(2021), 3889–3934 (cit. on p. 11). 14
work page 2021
-
[45]
Sarao Mannelli, S., Vanden-Eijnden, E. & Zdeborová, L. Optimization and generalization of shallow neural networks with quadratic activation functions.Advances in Neural Information Processing Systems33,13445–13455 (2020) (cit. on p. 11)
work page 2020
-
[46]
Mei, S., Montanari, A. & Nguyen, P.-M. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences115,E7665–E7671. eprint: https : //www.pnas.org/doi/pdf/10.1073/pnas.1806579115(2018) (cit. on p. 11)
-
[47]
& Tang, X.Deep Learning Face Attributes in the WildEnglish
Liu, Z., Luo, P., Wang, X. & Tang, X.Deep Learning Face Attributes in the WildEnglish. inProceedings of the IEEE International Conference on Computer Vision (ICCV)ISSN: 2380-7504 (IEEE Computer Society, Dec. 2015), 3730–3738 (cit. on p. 16)
work page 2015
-
[48]
McCullagh, P.Tensor methods in statistics(Courier Dover Publications, 2018) (cit. on p. 19)
work page 2018
-
[49]
Szegő, G.Orthogonal Polynomials(American mathematical society, 1939) (cit. on p. 19)
work page 1939
-
[50]
Abramowitz, M. & Stegun, I. A.Handbook of mathematical functions with formulas, graphs, and mathematical tablesxiv+1046 (National Bureau of Standards, 1964) (cit. on p. 19)
work page 1964
-
[51]
mean" clone from a Gaussian distribution with mean µ and identity covariance. We then sample the
Bandeira, A. S., Kunisky, D. & Wein, A. S.Computational Hardness of Certifying Bounds on Con- strained PCA Problemsin11th Innovations in Theoretical Computer Science Conference (ITCS 2020) (ed Vidick, T.)151(Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2020), 78:1–78:29 (cit. on pp. 19, 20). 15 a) mean b) mean + cov. c) test Figure A.1: Samples from ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.