A theory of learning data statistics in diffusion models, from easy to hard

Claudia Merger; Lorenzo Bardone; Sebastian Goldt

arxiv: 2603.12901 · v1 · pith:KLPQSYFTnew · submitted 2026-03-13 · 📊 stat.ML · cond-mat.dis-nn· cs.IT· cs.LG· math.IT

A theory of learning data statistics in diffusion models, from easy to hard

Lorenzo Bardone , Claudia Merger , Sebastian Goldt This is my paper

Pith reviewed 2026-05-21 11:28 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncs.ITcs.LGmath.IT

keywords diffusion modelssample complexitycumulantsdenoiserlearning dynamicsmixed cumulant modeldistributional simplicity bias

0 comments

The pith

Diffusion models learn pairwise input statistics at linear sample complexity before higher-order correlations like the fourth cumulant, which requires cubic complexity unless latent structures correlate them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors set out to explain the observed staged learning in diffusion models, where simple statistics appear before complex ones. They introduce a mixed cumulant model that lets them dial the strength of pairwise versus higher-order correlations in the training data. From this controlled setting they extract a single scalar, the diffusion information exponent, that fixes how many samples are needed to learn each type of statistic. When the exponent is small the denoiser masters pairwise moments linearly; when it is larger the fourth cumulant needs at least cubic samples. The same exponent drops back to linear once pairwise and higher-order terms share a common latent factor. If the picture holds, it supplies a first-principles account of why diffusion training naturally proceeds from easy to hard features.

Core claim

In the mixed cumulant model the denoiser learns pairwise statistics of the inputs at linear sample complexity while fourth-order cumulants require at least cubic sample complexity; the sample complexity of the fourth cumulant falls back to linear once pairwise and higher-order statistics are tied together by a shared latent structure. The governing quantity is a scalar invariant of the model called the diffusion information exponent.

What carries the argument

The diffusion information exponent, a scalar invariant extracted from the mixed cumulant model that sets the sample complexity required to recover statistics of a given order.

If this is right

Pairwise correlations are recovered first during training, producing the distributional simplicity bias seen on natural images.
Fourth-order cumulants are recovered only after the linear regime unless they share latent factors with the pairwise terms.
The diffusion information exponent directly predicts the sample threshold at which each order of statistic becomes learnable.
The staged acquisition of statistics offers a mechanism for how diffusion models build distributions of rising complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same exponent may govern learning order in other score-based or flow-based generative models that rely on denoising.
Synthetic datasets with tunable latent correlations could be used to test whether real image statistics follow the predicted linear-to-cubic transition.
If the exponent can be estimated from data, it might guide choices of training schedule or model capacity to accelerate acquisition of higher-order features.

Load-bearing premise

The mixed cumulant model is a faithful minimal representation of the statistical structure present in natural images.

What would settle it

Train a small denoiser on samples from the mixed cumulant model with controlled latent correlation between second- and fourth-order terms and measure whether the number of samples needed to recover the fourth cumulant scales linearly or cubically.

Figures

Figures reproduced from arXiv: 2603.12901 by Claudia Merger, Lorenzo Bardone, Sebastian Goldt.

**Figure 2.** Figure 2: Examples of the contraction term Λ for different choices of activation σ. σ ∗ denotes the matched functional form of the score Eq. (B.14) for different values of the diffusion time t. We can see this loss of performance most clearly in the single spike case with βu = 0. Expanding the gradient of the population loss, we notice that the removal of the spherical constraint leads to appearance of an additional… view at source ↗

**Figure 3.** Figure 3: Normalized overlap of first-layer weights of neural networks of varying depth trained with [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images exhibit a distributional simplicity bias, learning simple, pair-wise input statistics before specializing to higher-order correlations. We reproduce this behaviour in simple denoisers trained on a minimal data model, the mixed cumulant model, where we precisely control both pair-wise and higher-order correlations of the inputs. We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent, in analogy to related invariants in different learning paradigms. Using this invariant, we prove that the denoiser learns simple, pair-wise statistics of the inputs at linear sample complexity, while more complex higher-order statistics, such as the fourth cumulant, require at least cubic sample complexity. We also prove that the sample complexity of learning the fourth cumulant is linear if pair-wise and higher-order statistics share a correlated latent structure. Our work describes a key mechanism for how diffusion models can learn distributions of increasing complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives explicit linear-versus-cubic sample-complexity bounds via a diffusion information exponent in a mixed cumulant model, but the step from that model to natural-image statistics is still untested.

read the letter

This paper builds a minimal mixed cumulant model where pairwise and higher-order correlations can be set independently. They define a diffusion information exponent that sets the sample complexity for the denoiser, then prove that pairwise statistics are learned at linear complexity while fourth cumulants need at least cubic. When the two share a correlated latent structure the higher-order case drops back to linear. They also show that real diffusion models on images display the same simplicity bias, learning simple stats first before specializing.

Referee Report

2 major / 1 minor

Summary. The paper empirically demonstrates a distributional simplicity bias in diffusion models trained on natural images, where pair-wise statistics are learned before higher-order correlations. To explain this, the authors introduce a mixed cumulant model allowing control over pair-wise and higher-order correlations, and define a scalar invariant called the diffusion information exponent that governs sample complexity. They claim to prove that the denoiser learns pair-wise statistics at linear sample complexity, higher-order statistics such as the fourth cumulant at least cubic sample complexity, and linear complexity for the fourth cumulant when pair-wise and higher-order statistics share a correlated latent structure. This is positioned as a mechanism for how diffusion models learn distributions of increasing complexity.

Significance. If the proofs are complete and gap-free and the mixed cumulant model faithfully captures the relevant statistical structures driving the bias in natural images, this work would provide a valuable theoretical account of simplicity biases in diffusion models via an information-theoretic invariant. It draws an analogy to similar exponents in other learning settings and offers a controlled setting in which to analyze the denoiser objective, which could inform analyses of generative model training dynamics.

major comments (2)

[Mixed cumulant model and diffusion information exponent] The diffusion information exponent is introduced as a scalar invariant of the mixed cumulant model that governs the sample complexities. It must be verified that the definition of this exponent (and the associated reduction from the denoiser objective) is independent of the target linear/cubic bounds rather than constructed to produce them, to avoid any risk of circularity in the central claims.
[Proofs for sample complexity bounds] The abstract states that proofs exist for the linear and cubic bounds and for the latent-structure case. These derivations should be examined in detail for gaps, particularly in how the information exponent is applied to bound the sample complexity of learning the fourth cumulant from the denoiser objective.

minor comments (1)

[Empirical results] The empirical reproduction of the simplicity bias in simple denoisers on the mixed cumulant model is useful; additional details on network architectures, training hyperparameters, and quantitative metrics used to measure when statistics are learned would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and detailed report. Their comments highlight important points about the foundations of our central claims, and we have used this feedback to strengthen the presentation of the mixed cumulant model and the associated proofs. We address each major comment below.

read point-by-point responses

Referee: [Mixed cumulant model and diffusion information exponent] The diffusion information exponent is introduced as a scalar invariant of the mixed cumulant model that governs the sample complexities. It must be verified that the definition of this exponent (and the associated reduction from the denoiser objective) is independent of the target linear/cubic bounds rather than constructed to produce them, to avoid any risk of circularity in the central claims.

Authors: We appreciate the referee's concern and agree that circularity must be avoided. The diffusion information exponent is defined directly from the mixed cumulant model prior to any sample-complexity analysis: it is the infimum of α > 0 such that the fourth-order cumulant tensor is controlled by the pairwise covariance raised to the power α/2 in the appropriate tensor norm (see Definition 3.2). This definition is motivated by the structure of the data-generating process and draws an explicit analogy to information exponents appearing in other learning settings (e.g., tensor PCA and planted problems). The reduction from the denoiser objective to cumulant estimation follows from the explicit form of the score function under the mixed cumulant model and holds for any α; it does not presuppose the values 1 or 3. Only after these steps do we instantiate the model with independent versus correlated latents and obtain the concrete exponents α = 1 (pairwise) and α = 3 (fourth cumulant). We have added a new clarifying paragraph immediately after Definition 3.2 that states this logical order explicitly. revision: partial
Referee: [Proofs for sample complexity bounds] The abstract states that proofs exist for the linear and cubic bounds and for the latent-structure case. These derivations should be examined in detail for gaps, particularly in how the information exponent is applied to bound the sample complexity of learning the fourth cumulant from the denoiser objective.

Authors: We have re-examined the proofs in the appendix. The argument proceeds by (i) showing that the population denoiser recovers the relevant cumulants exactly, (ii) bounding the deviation of the empirical denoiser from the population one via a Lipschitz property of the diffusion loss, and (iii) applying a matrix/tensor concentration inequality whose rate is governed by the diffusion information exponent α. For the fourth cumulant without latent correlation, α = 3 yields the cubic sample-complexity lower and upper bounds. When pairwise and higher-order statistics share a latent factor, the effective exponent collapses to 1, recovering linear complexity. The steps are spelled out in Lemmas B.3–B.7 and Theorem 4.3. To make the application of the exponent more transparent, we have inserted an expanded proof sketch in Section 4.2 and added an intermediate lemma that isolates the role of α. We believe the derivations contain no gaps, but we welcome further scrutiny of the revised appendix. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; diffusion information exponent derived independently and used in genuine proofs

full rationale

The paper first empirically observes distributional simplicity bias in diffusion models on natural images, then constructs a minimal mixed cumulant model to control pairwise and higher-order statistics. It identifies the diffusion information exponent as a scalar invariant of this model (in analogy to other learning paradigms) and deploys it to prove linear sample complexity for pairwise statistics and cubic for the fourth cumulant (linear under correlated latents). These proofs are presented as mathematical derivations from the model's structure rather than tautological re-statements of fitted parameters or self-referential definitions. No load-bearing step reduces by construction to the target result; the central claims rest on explicit proofs whose assumptions are stated separately from the conclusions. The extrapolation to natural images is framed as an assumption rather than a derived necessity, keeping the internal derivation chain non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the mixed cumulant model faithfully capturing image statistics and on the diffusion information exponent being the correct scalar invariant that controls sample complexity; no free parameters are mentioned in the abstract.

axioms (1)

domain assumption The mixed cumulant model is a faithful minimal representation of the statistical structure present in natural images.
The model is introduced to reproduce the empirical simplicity bias observed on real images, so its validity is required for the theoretical results to explain practical diffusion training.

invented entities (1)

diffusion information exponent no independent evidence
purpose: Scalar invariant that governs the sample complexity of learning pairwise versus higher-order correlations in the diffusion denoiser.
Introduced by analogy to invariants in other learning paradigms; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5740 in / 1469 out tokens · 48594 ms · 2026-05-21T11:28:54.298298+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent... k⋆... ατ+1 = ατ + η/d cL_k⋆ cF_k⋆−1 α^k⋆−1_τ + O(α^k⋆_τ)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mixed cumulant model... xμ = √βu λμ u + √βv νμ v + zμ... recovery of the cumulant spike v... k∗=4... cubic sample complexity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distributional simplicity bias and effective convexity in Energy Based Models
cs.LG 2026-05 unverdicted novelty 6.0

Gradient flow in energy-based models for strictly positive binary distributions produces stable data-consistent fixed points and a learning hierarchy that favors lower-order interactions first, mechanistically explain...
Understanding diffusion models requires rethinking (again) generalization
cs.LG 2026-05 unverdicted novelty 5.0

Diffusion models require new generalization frameworks because memorization and novel generation are incompatible, so research should focus on what models learn before memorization begins.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

& Ganguli, S.Deep Unsupervised Learning using Nonequilibrium ThermodynamicsinProceedings of the 32nd International Conference on Machine Learning(eds Bach, F

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S.Deep Unsupervised Learning using Nonequilibrium ThermodynamicsinProceedings of the 32nd International Conference on Machine Learning(eds Bach, F. & Blei, D.)37(PMLR, Lille, France, 2015), 2256–2265 (cit. on p. 1)

work page 2015
[2]

Denoising Diffusion Probabilistic Models

Ho, J., Jain, A. & Abbeel, P.Denoising Diffusion Probabilistic ModelsarXiv:2006.11239 [cs, stat]. Dec. 2020 (cit. on pp. 1, 3)

work page internal anchor Pith review Pith/arXiv arXiv 2006
[3]

& Ermon, S.Generative Modeling by Estimating Gradients of the Data Distributionin Advances in Neural Information Processing Systems32(Curran Associates, Inc., 2019) (cit

Song, Y. & Ermon, S.Generative Modeling by Estimating Gradients of the Data Distributionin Advances in Neural Information Processing Systems32(Curran Associates, Inc., 2019) (cit. on pp. 1, 3)

work page 2019
[4]

& Solla, S

Saad, D. & Solla, S. Exact Solution for On-Line Learning in Multilayer Neural Networks.Phys. Rev. Lett.74,4337–4340 (1995) (cit. on p. 1)

work page 1995
[5]

M., McClelland, J

Saxe, A. M., McClelland, J. L. & Ganguli, S.Exact solutions to the nonlinear dynamics of learning in deep linear neural networksinICLR(2014) (cit. on p. 1)

work page 2014
[6]

M., McClelland, J

Saxe, A. M., McClelland, J. L. & Ganguli, S. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences116,11537–11546 (2019) (cit. on p. 1). 12

work page 2019
[7]

Abbe, E., Adsera, E. B. & Misiakiewicz, T.Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamicsinThe Thirty Sixth Annual Conference on Learning Theory(2023), 2552– 2623 (cit. on p. 1)

work page 2023
[8]

& Stephan, L

Dandi, Y., Krzakala, F., Loureiro, B., Pesce, L. & Stephan, L. How Two-Layer Neural Networks Learn, One (Giant) Step at a Time.Journal of Machine Learning Research25,1–65 (2024) (cit. on pp. 1, 5)

work page 2024
[9]

& Zhou, K

Berthier, R., Montanari, A. & Zhou, K. Learning time-scales in two-layers neural networks.Founda- tions of Computational Mathematics25,1627–1710 (2025) (cit. on p. 1)

work page 2025
[10]

& Mondelli, M.Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and DepthinForty-first International Conference on Machine Learning(2024) (cit

Kögler, K., Shevchenko, A., Hassani, H. & Mondelli, M.Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and DepthinForty-first International Conference on Machine Learning(2024) (cit. on p. 1)

work page 2024
[11]

Kalimeris, D.et al. SGD on Neural Networks Learns Functions of Increasing ComplexityinAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada(eds Wallach, H. M.et al.) (2019), 3491–3501 (cit. on p. 1)

work page 2019
[12]

& Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018) (cit

Farnia, F., Zhang, J. & Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018) (cit. on p. 1)

work page 2018
[13]

On the Spectral Bias of Neural NetworksinProc

Rahaman, N.et al. On the Spectral Bias of Neural NetworksinProc. of ICML(eds Chaudhuri, K. & Salakhutdinov, R.)97(PMLR, 2019), 5301–5310 (cit. on p. 1)

work page 2019
[14]

& Goldt, S

Ingrosso, A. & Goldt, S. Data-driven emergence of convolutional structure in neural networks. Proceedings of the National Academy of Sciences119,e2201854119 (2022) (cit. on p. 1)

work page 2022
[15]

2023) (cit

Merger, C.et al.Learning Interacting Theories from Data.Physical Review X13.Publisher: American Physical Society, 041033 (Nov. 2023) (cit. on p. 1)

work page 2023
[16]

& Goldt, S.Neural networks trained with SGD learn distributions of increasing complexityinInternational Conference on Machine Learning(2023), 28843–28863 (cit

Refinetti, M., Ingrosso, A. & Goldt, S.Neural networks trained with SGD learn distributions of increasing complexityinInternational Conference on Machine Learning(2023), 28843–28863 (cit. on pp. 1, 3)

work page 2023
[17]

Bardone, L. & Goldt, S.Sliding Down the Stairs: How Correlated Latent Variables Accelerate Learning with Neural NetworksinProceedings of the 41st International Conference on Machine Learning235 (PMLR, 2024), 3024–3045 (cit. on pp. 1, 4, 5, 8, 11, 23)

work page 2024
[18]

& Goldt, S

Rende, R., Gerace, F., Laio, A. & Goldt, S. A distributional simplicity bias in the learning dynamics of transformers.Advances in Neural Information Processing Systems37,96207–96228 (2024) (cit. on p. 1)

work page 2024
[19]

& Fern, X

Belrose, N., Pope, Q., Quirke, L., Mallen, A. & Fern, X. Neural Networks Learn Statistics of Increasing Complexity.arXiv preprint arXiv:2402.04362(2024) (cit. on p. 1)

work page arXiv 2024
[20]

& Wyart, M.How compositional generalization and creativity improve as diffusion models are traineden

Favero, A., Sclocchi, A., Cagnetta, F., Frossard, P. & Wyart, M.How compositional generalization and creativity improve as diffusion models are traineden. arXiv:2502.12089 [stat]. Mar. 2025 (cit. on p. 1)

work page arXiv 2025
[21]

& Saglietti, L.How Transformers Learn Structured Data: Insights From Hierarchical FilteringinForty-second International Conference on Machine Learning (2025) (cit

Garnier-Brun, J., Mézard, M., Moscato, E. & Saglietti, L.How Transformers Learn Structured Data: Insights From Hierarchical FilteringinForty-second International Conference on Machine Learning (2025) (cit. on p. 1)

work page 2025
[22]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, O., Fischer, P. & Brox, T.U-Net: Convolutional Networks for Biomedical Image Seg- mentationarXiv:1505.04597 [cs]. May 2015 (cit. on pp. 2, 3, 16)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[23]

Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure

Li, X., Dai, Y. & Qu, Q.Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure2024. arXiv:2410.24060 [cs.LG](cit. on p. 2)

work page arXiv
[24]

& Mézard, M.Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in TrainingarXiv:2505.17638 [cs]

Bonnaire, T., Urfin, R., Biroli, G. & Mézard, M.Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in TrainingarXiv:2505.17638 [cs]. May 2025 (cit. on p. 2)

work page arXiv 2025
[25]

J., Veiga, R

George, A. J., Veiga, R. & Macris, N.Analysis of Diffusion Models for Manifold DataarXiv:2502.04339 [math]. Feb. 2025 (cit. on p. 2). 13

work page arXiv 2025
[26]

& Goldt, S.Generalization Dynamics of Linear Diffusion ModelsMay 2025 (cit

Merger, C. & Goldt, S.Generalization Dynamics of Linear Diffusion ModelsMay 2025 (cit. on p. 2)

work page 2025
[27]

Wang, B. & Pehlevan, C.An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion ModelsinThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025) (cit. on p. 2)

work page 2025
[28]

& Goldt, S.The dynamics of representation learning in shallow, non-linear autoencoders inInternational Conference on Machine Learning(2022), 18499–18519 (cit

Refinetti, M. & Goldt, S.The dynamics of representation learning in shallow, non-linear autoencoders inInternational Conference on Machine Learning(2022), 18499–18519 (cit. on pp. 2, 5)

work page 2022
[29]

& Zdeborová, L

Cui, H. & Zdeborová, L. High-dimensional asymptotics of denoising autoencoders.Advances in Neural Information Processing Systems36,11850–11890 (2023) (cit. on p. 2)

work page 2023
[30]

& Zdeborova, L.Analysis of Learning a Flow-based Gener- ative Model from Limited Sample ComplexityinThe Twelfth International Conference on Learning Representations(2024) (cit

Cui, H., Krzakala, F., Vanden-Eijnden, E. & Zdeborova, L.Analysis of Learning a Flow-based Gener- ative Model from Limited Sample ComplexityinThe Twelfth International Conference on Learning Representations(2024) (cit. on p. 2)

work page 2024
[31]

Cui, H., Pehlevan, C. & Lu, Y. M.A solvable model of learning generative diffusion: theory and insights inThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025) (cit. on pp. 2, 5)

work page 2025
[32]

& Jagannath, A

Ben Arous, G., Gheissari, R. & Jagannath, A. Online Stochastic Gradient Descent on Non-Convex Losses from High-Dimensional Inference.J. Mach. Learn. Res.22(2021) (cit. on pp. 3, 5–7, 22)

work page 2021
[33]

Krizhevsky, A.Learning Multiple Layers of Features from Tiny Images2009 (cit. on p. 3)

work page
[34]

& Mézard, M

Biroli, G. & Mézard, M. Generative diffusion in very large dimensions. en.Journal of Statistical Mechanics: Theory and Experiment2023,093402 (Sept. 2023) (cit. on pp. 4, 20)

work page 2023
[35]

Anderson, B. D. Reverse-time diffusion equation models. en.Stochastic Processes and their Applica- tions12,313–326 (May 1982) (cit. on p. 4)

work page 1982
[36]

& Klivans, A.Learning Mixtures of Gaussians Using the DDPM Objective2023

Shah, K., Chen, S. & Klivans, A.Learning Mixtures of Gaussians Using the DDPM Objective2023. arXiv:2307.01178 [cs.DS](cit. on pp. 4, 21)

work page arXiv
[37]

Mendes, V. C.et al. A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization2026. arXiv:2602.10680 [stat.ML] (cit. on p. 5)

work page arXiv
[38]

& Bruna, J.Computational-Statistical Gaps in Gaussian Single- Index Models (Extended Abstract)inProceedings of Thirty Seventh Conference on Learning Theory (eds Agrawal, S

Damian, A., Pillaud-Vivien, L., Lee, J. & Bruna, J.Computational-Statistical Gaps in Gaussian Single- Index Models (Extended Abstract)inProceedings of Thirty Seventh Conference on Learning Theory (eds Agrawal, S. & Roth, A.)247(PMLR, 2024), 1262–1262 (cit. on p. 5)

work page 2024
[39]

& Goldt, S

Székely, E., Bardone, L., Gerace, F. & Goldt, S. Learning from higher-order correlations, efficiently: hypothesis tests, random features, and neural networks. en.Advances in Neural Information Pro- cessing Systems37,78479–78522 (Dec. 2024) (cit. on pp. 7, 21)

work page 2024
[40]

& Ricci-Tersenghi, F

Biroli, G., Cammarota, C. & Ricci-Tersenghi, F. How to iron out rough landscapes and get optimal performances: averaged gradient descent and its application to tensor PCA.Journal of Physics A: Mathematical and Theoretical53,174003 (2020) (cit. on p. 7)

work page 2020
[41]

& Lee, J

Damian, A., Nichani, E., Ge, R. & Lee, J. D.Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index ModelsinThirty-seventh Conference on Neural Information Processing Systems(2023) (cit. on pp. 7, 11)

work page 2023
[42]

& Goldt, S

Ricci, F., Bardone, L. & Goldt, S. Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensions.arXiv preprint arXiv:2503.23896(2025) (cit. on p. 7)

work page arXiv 2025
[43]

& Broeck, C

Engel, A. & Broeck, C. V. D.Statistical Mechanics of Learning(Cambridge University Press, 2001) (cit. on p. 10)

work page 2001
[44]

M., Yehudai, G

Safran, I. M., Yehudai, G. & Shamir, O.The effects of mild over-parameterization on the optimization landscape of shallow relu neural networksinConference on Learning Theory(2021), 3889–3934 (cit. on p. 11). 14

work page 2021
[45]

& Zdeborová, L

Sarao Mannelli, S., Vanden-Eijnden, E. & Zdeborová, L. Optimization and generalization of shallow neural networks with quadratic activation functions.Advances in Neural Information Processing Systems33,13445–13455 (2020) (cit. on p. 11)

work page 2020
[46]

& Nguyen, P.-M

Mei, S., Montanari, A. & Nguyen, P.-M. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences115,E7665–E7671. eprint: https : //www.pnas.org/doi/pdf/10.1073/pnas.1806579115(2018) (cit. on p. 11)

work page doi:10.1073/pnas.1806579115(2018 2018
[47]

& Tang, X.Deep Learning Face Attributes in the WildEnglish

Liu, Z., Luo, P., Wang, X. & Tang, X.Deep Learning Face Attributes in the WildEnglish. inProceedings of the IEEE International Conference on Computer Vision (ICCV)ISSN: 2380-7504 (IEEE Computer Society, Dec. 2015), 3730–3738 (cit. on p. 16)

work page 2015
[48]

McCullagh, P.Tensor methods in statistics(Courier Dover Publications, 2018) (cit. on p. 19)

work page 2018
[49]

Szegő, G.Orthogonal Polynomials(American mathematical society, 1939) (cit. on p. 19)

work page 1939
[50]

& Stegun, I

Abramowitz, M. & Stegun, I. A.Handbook of mathematical functions with formulas, graphs, and mathematical tablesxiv+1046 (National Bureau of Standards, 1964) (cit. on p. 19)

work page 1964
[51]

mean" clone from a Gaussian distribution with mean µ and identity covariance. We then sample the

Bandeira, A. S., Kunisky, D. & Wein, A. S.Computational Hardness of Certifying Bounds on Con- strained PCA Problemsin11th Innovations in Theoretical Computer Science Conference (ITCS 2020) (ed Vidick, T.)151(Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2020), 78:1–78:29 (cit. on pp. 19, 20). 15 a) mean b) mean + cov. c) test Figure A.1: Samples from ...

work page 2020

[1] [1]

& Ganguli, S.Deep Unsupervised Learning using Nonequilibrium ThermodynamicsinProceedings of the 32nd International Conference on Machine Learning(eds Bach, F

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S.Deep Unsupervised Learning using Nonequilibrium ThermodynamicsinProceedings of the 32nd International Conference on Machine Learning(eds Bach, F. & Blei, D.)37(PMLR, Lille, France, 2015), 2256–2265 (cit. on p. 1)

work page 2015

[2] [2]

Denoising Diffusion Probabilistic Models

Ho, J., Jain, A. & Abbeel, P.Denoising Diffusion Probabilistic ModelsarXiv:2006.11239 [cs, stat]. Dec. 2020 (cit. on pp. 1, 3)

work page internal anchor Pith review Pith/arXiv arXiv 2006

[3] [3]

& Ermon, S.Generative Modeling by Estimating Gradients of the Data Distributionin Advances in Neural Information Processing Systems32(Curran Associates, Inc., 2019) (cit

Song, Y. & Ermon, S.Generative Modeling by Estimating Gradients of the Data Distributionin Advances in Neural Information Processing Systems32(Curran Associates, Inc., 2019) (cit. on pp. 1, 3)

work page 2019

[4] [4]

& Solla, S

Saad, D. & Solla, S. Exact Solution for On-Line Learning in Multilayer Neural Networks.Phys. Rev. Lett.74,4337–4340 (1995) (cit. on p. 1)

work page 1995

[5] [5]

M., McClelland, J

Saxe, A. M., McClelland, J. L. & Ganguli, S.Exact solutions to the nonlinear dynamics of learning in deep linear neural networksinICLR(2014) (cit. on p. 1)

work page 2014

[6] [6]

M., McClelland, J

Saxe, A. M., McClelland, J. L. & Ganguli, S. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences116,11537–11546 (2019) (cit. on p. 1). 12

work page 2019

[7] [7]

Abbe, E., Adsera, E. B. & Misiakiewicz, T.Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamicsinThe Thirty Sixth Annual Conference on Learning Theory(2023), 2552– 2623 (cit. on p. 1)

work page 2023

[8] [8]

& Stephan, L

Dandi, Y., Krzakala, F., Loureiro, B., Pesce, L. & Stephan, L. How Two-Layer Neural Networks Learn, One (Giant) Step at a Time.Journal of Machine Learning Research25,1–65 (2024) (cit. on pp. 1, 5)

work page 2024

[9] [9]

& Zhou, K

Berthier, R., Montanari, A. & Zhou, K. Learning time-scales in two-layers neural networks.Founda- tions of Computational Mathematics25,1627–1710 (2025) (cit. on p. 1)

work page 2025

[10] [10]

& Mondelli, M.Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and DepthinForty-first International Conference on Machine Learning(2024) (cit

Kögler, K., Shevchenko, A., Hassani, H. & Mondelli, M.Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and DepthinForty-first International Conference on Machine Learning(2024) (cit. on p. 1)

work page 2024

[11] [11]

Kalimeris, D.et al. SGD on Neural Networks Learns Functions of Increasing ComplexityinAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada(eds Wallach, H. M.et al.) (2019), 3491–3501 (cit. on p. 1)

work page 2019

[12] [12]

& Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018) (cit

Farnia, F., Zhang, J. & Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018) (cit. on p. 1)

work page 2018

[13] [13]

On the Spectral Bias of Neural NetworksinProc

Rahaman, N.et al. On the Spectral Bias of Neural NetworksinProc. of ICML(eds Chaudhuri, K. & Salakhutdinov, R.)97(PMLR, 2019), 5301–5310 (cit. on p. 1)

work page 2019

[14] [14]

& Goldt, S

Ingrosso, A. & Goldt, S. Data-driven emergence of convolutional structure in neural networks. Proceedings of the National Academy of Sciences119,e2201854119 (2022) (cit. on p. 1)

work page 2022

[15] [15]

2023) (cit

Merger, C.et al.Learning Interacting Theories from Data.Physical Review X13.Publisher: American Physical Society, 041033 (Nov. 2023) (cit. on p. 1)

work page 2023

[16] [16]

& Goldt, S.Neural networks trained with SGD learn distributions of increasing complexityinInternational Conference on Machine Learning(2023), 28843–28863 (cit

Refinetti, M., Ingrosso, A. & Goldt, S.Neural networks trained with SGD learn distributions of increasing complexityinInternational Conference on Machine Learning(2023), 28843–28863 (cit. on pp. 1, 3)

work page 2023

[17] [17]

Bardone, L. & Goldt, S.Sliding Down the Stairs: How Correlated Latent Variables Accelerate Learning with Neural NetworksinProceedings of the 41st International Conference on Machine Learning235 (PMLR, 2024), 3024–3045 (cit. on pp. 1, 4, 5, 8, 11, 23)

work page 2024

[18] [18]

& Goldt, S

Rende, R., Gerace, F., Laio, A. & Goldt, S. A distributional simplicity bias in the learning dynamics of transformers.Advances in Neural Information Processing Systems37,96207–96228 (2024) (cit. on p. 1)

work page 2024

[19] [19]

& Fern, X

Belrose, N., Pope, Q., Quirke, L., Mallen, A. & Fern, X. Neural Networks Learn Statistics of Increasing Complexity.arXiv preprint arXiv:2402.04362(2024) (cit. on p. 1)

work page arXiv 2024

[20] [20]

& Wyart, M.How compositional generalization and creativity improve as diffusion models are traineden

Favero, A., Sclocchi, A., Cagnetta, F., Frossard, P. & Wyart, M.How compositional generalization and creativity improve as diffusion models are traineden. arXiv:2502.12089 [stat]. Mar. 2025 (cit. on p. 1)

work page arXiv 2025

[21] [21]

& Saglietti, L.How Transformers Learn Structured Data: Insights From Hierarchical FilteringinForty-second International Conference on Machine Learning (2025) (cit

Garnier-Brun, J., Mézard, M., Moscato, E. & Saglietti, L.How Transformers Learn Structured Data: Insights From Hierarchical FilteringinForty-second International Conference on Machine Learning (2025) (cit. on p. 1)

work page 2025

[22] [22]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, O., Fischer, P. & Brox, T.U-Net: Convolutional Networks for Biomedical Image Seg- mentationarXiv:1505.04597 [cs]. May 2015 (cit. on pp. 2, 3, 16)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[23] [23]

Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure

Li, X., Dai, Y. & Qu, Q.Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure2024. arXiv:2410.24060 [cs.LG](cit. on p. 2)

work page arXiv

[24] [24]

& Mézard, M.Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in TrainingarXiv:2505.17638 [cs]

Bonnaire, T., Urfin, R., Biroli, G. & Mézard, M.Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in TrainingarXiv:2505.17638 [cs]. May 2025 (cit. on p. 2)

work page arXiv 2025

[25] [25]

J., Veiga, R

George, A. J., Veiga, R. & Macris, N.Analysis of Diffusion Models for Manifold DataarXiv:2502.04339 [math]. Feb. 2025 (cit. on p. 2). 13

work page arXiv 2025

[26] [26]

& Goldt, S.Generalization Dynamics of Linear Diffusion ModelsMay 2025 (cit

Merger, C. & Goldt, S.Generalization Dynamics of Linear Diffusion ModelsMay 2025 (cit. on p. 2)

work page 2025

[27] [27]

Wang, B. & Pehlevan, C.An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion ModelsinThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025) (cit. on p. 2)

work page 2025

[28] [28]

& Goldt, S.The dynamics of representation learning in shallow, non-linear autoencoders inInternational Conference on Machine Learning(2022), 18499–18519 (cit

Refinetti, M. & Goldt, S.The dynamics of representation learning in shallow, non-linear autoencoders inInternational Conference on Machine Learning(2022), 18499–18519 (cit. on pp. 2, 5)

work page 2022

[29] [29]

& Zdeborová, L

Cui, H. & Zdeborová, L. High-dimensional asymptotics of denoising autoencoders.Advances in Neural Information Processing Systems36,11850–11890 (2023) (cit. on p. 2)

work page 2023

[30] [30]

& Zdeborova, L.Analysis of Learning a Flow-based Gener- ative Model from Limited Sample ComplexityinThe Twelfth International Conference on Learning Representations(2024) (cit

Cui, H., Krzakala, F., Vanden-Eijnden, E. & Zdeborova, L.Analysis of Learning a Flow-based Gener- ative Model from Limited Sample ComplexityinThe Twelfth International Conference on Learning Representations(2024) (cit. on p. 2)

work page 2024

[31] [31]

Cui, H., Pehlevan, C. & Lu, Y. M.A solvable model of learning generative diffusion: theory and insights inThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025) (cit. on pp. 2, 5)

work page 2025

[32] [32]

& Jagannath, A

Ben Arous, G., Gheissari, R. & Jagannath, A. Online Stochastic Gradient Descent on Non-Convex Losses from High-Dimensional Inference.J. Mach. Learn. Res.22(2021) (cit. on pp. 3, 5–7, 22)

work page 2021

[33] [33]

Krizhevsky, A.Learning Multiple Layers of Features from Tiny Images2009 (cit. on p. 3)

work page

[34] [34]

& Mézard, M

Biroli, G. & Mézard, M. Generative diffusion in very large dimensions. en.Journal of Statistical Mechanics: Theory and Experiment2023,093402 (Sept. 2023) (cit. on pp. 4, 20)

work page 2023

[35] [35]

Anderson, B. D. Reverse-time diffusion equation models. en.Stochastic Processes and their Applica- tions12,313–326 (May 1982) (cit. on p. 4)

work page 1982

[36] [36]

& Klivans, A.Learning Mixtures of Gaussians Using the DDPM Objective2023

Shah, K., Chen, S. & Klivans, A.Learning Mixtures of Gaussians Using the DDPM Objective2023. arXiv:2307.01178 [cs.DS](cit. on pp. 4, 21)

work page arXiv

[37] [37]

Mendes, V. C.et al. A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization2026. arXiv:2602.10680 [stat.ML] (cit. on p. 5)

work page arXiv

[38] [38]

& Bruna, J.Computational-Statistical Gaps in Gaussian Single- Index Models (Extended Abstract)inProceedings of Thirty Seventh Conference on Learning Theory (eds Agrawal, S

Damian, A., Pillaud-Vivien, L., Lee, J. & Bruna, J.Computational-Statistical Gaps in Gaussian Single- Index Models (Extended Abstract)inProceedings of Thirty Seventh Conference on Learning Theory (eds Agrawal, S. & Roth, A.)247(PMLR, 2024), 1262–1262 (cit. on p. 5)

work page 2024

[39] [39]

& Goldt, S

Székely, E., Bardone, L., Gerace, F. & Goldt, S. Learning from higher-order correlations, efficiently: hypothesis tests, random features, and neural networks. en.Advances in Neural Information Pro- cessing Systems37,78479–78522 (Dec. 2024) (cit. on pp. 7, 21)

work page 2024

[40] [40]

& Ricci-Tersenghi, F

Biroli, G., Cammarota, C. & Ricci-Tersenghi, F. How to iron out rough landscapes and get optimal performances: averaged gradient descent and its application to tensor PCA.Journal of Physics A: Mathematical and Theoretical53,174003 (2020) (cit. on p. 7)

work page 2020

[41] [41]

& Lee, J

Damian, A., Nichani, E., Ge, R. & Lee, J. D.Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index ModelsinThirty-seventh Conference on Neural Information Processing Systems(2023) (cit. on pp. 7, 11)

work page 2023

[42] [42]

& Goldt, S

Ricci, F., Bardone, L. & Goldt, S. Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensions.arXiv preprint arXiv:2503.23896(2025) (cit. on p. 7)

work page arXiv 2025

[43] [43]

& Broeck, C

Engel, A. & Broeck, C. V. D.Statistical Mechanics of Learning(Cambridge University Press, 2001) (cit. on p. 10)

work page 2001

[44] [44]

M., Yehudai, G

Safran, I. M., Yehudai, G. & Shamir, O.The effects of mild over-parameterization on the optimization landscape of shallow relu neural networksinConference on Learning Theory(2021), 3889–3934 (cit. on p. 11). 14

work page 2021

[45] [45]

& Zdeborová, L

Sarao Mannelli, S., Vanden-Eijnden, E. & Zdeborová, L. Optimization and generalization of shallow neural networks with quadratic activation functions.Advances in Neural Information Processing Systems33,13445–13455 (2020) (cit. on p. 11)

work page 2020

[46] [46]

& Nguyen, P.-M

Mei, S., Montanari, A. & Nguyen, P.-M. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences115,E7665–E7671. eprint: https : //www.pnas.org/doi/pdf/10.1073/pnas.1806579115(2018) (cit. on p. 11)

work page doi:10.1073/pnas.1806579115(2018 2018

[47] [47]

& Tang, X.Deep Learning Face Attributes in the WildEnglish

Liu, Z., Luo, P., Wang, X. & Tang, X.Deep Learning Face Attributes in the WildEnglish. inProceedings of the IEEE International Conference on Computer Vision (ICCV)ISSN: 2380-7504 (IEEE Computer Society, Dec. 2015), 3730–3738 (cit. on p. 16)

work page 2015

[48] [48]

McCullagh, P.Tensor methods in statistics(Courier Dover Publications, 2018) (cit. on p. 19)

work page 2018

[49] [49]

Szegő, G.Orthogonal Polynomials(American mathematical society, 1939) (cit. on p. 19)

work page 1939

[50] [50]

& Stegun, I

Abramowitz, M. & Stegun, I. A.Handbook of mathematical functions with formulas, graphs, and mathematical tablesxiv+1046 (National Bureau of Standards, 1964) (cit. on p. 19)

work page 1964

[51] [51]

mean" clone from a Gaussian distribution with mean µ and identity covariance. We then sample the

Bandeira, A. S., Kunisky, D. & Wein, A. S.Computational Hardness of Certifying Bounds on Con- strained PCA Problemsin11th Innovations in Theoretical Computer Science Conference (ITCS 2020) (ed Vidick, T.)151(Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2020), 78:1–78:29 (cit. on pp. 19, 20). 15 a) mean b) mean + cov. c) test Figure A.1: Samples from ...

work page 2020