pith. machine review for the scientific record. sign in

arxiv: 2605.06367 · v1 · submitted 2026-05-07 · 📊 stat.ML · cond-mat.dis-nn· cs.LG

Recognition: unknown

The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:55 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncs.LG
keywords diffusion modelsclass imbalancelearning dynamicsGaussian mixturesscore-based modelsmemorizationgeneralizationdata heterogeneity
0
0 comments X

The pith

Class variance sets the primary learning order in diffusion models, with higher-variance classes learned first.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in score-based diffusion models trained on heterogeneous data, the variance of each class is the main factor deciding when that class is learned during training, with higher-variance classes consistently reaching generalization and memorization earlier. Centroid geometry exerts a weaker influence on this order. Sampling imbalance functions as a modulator that can invert the variance-driven sequence and, when severe, imposes separate later times at which minority classes acquire their distinct features in the reverse diffusion process. This matters because real datasets combine structural differences with unequal class sizes, raising the possibility that models fully memorize some classes while leaving others underlearned.

Core claim

Analyzing a random-features model trained on Gaussian mixtures, the authors derive the feature-covariance spectrum to characterize per-class generalization and memorization times. They show that class variance is the primary determinant of the learning hierarchy, consistently favoring higher-variance classes, while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion.

What carries the argument

The feature-covariance spectrum of the random-features model on Gaussian mixtures, which directly determines the per-class times for generalization and memorization.

If this is right

  • Higher-variance classes reach both generalization and memorization earlier than lower-variance classes.
  • Strong sampling imbalance can reverse the variance-based learning order.
  • Minority classes develop distinct, delayed speciation times in the backward diffusion process when imbalance is large.
  • Diffusion models can fully memorize some classes while others remain insufficiently learned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures could incorporate variance-aware sampling or augmentation to reduce disparities in when classes are learned.
  • The same variance-imbalance interaction may appear in other score-based or flow-matching generative models.
  • On highly imbalanced real-world data such as medical images, minority classes might require explicit regularization to avoid delayed or incomplete learning.

Load-bearing premise

The random-features model on Gaussian mixtures captures the essential per-class learning dynamics of full U-Net diffusion models trained on real heterogeneous image data.

What would settle it

Train a U-Net diffusion model on controlled Gaussian-mixture data with independently varied class variances and sampling rates, then measure whether the observed per-class learning curves follow the exact hierarchy and speciation times predicted by the feature-covariance spectrum.

Figures

Figures reproduced from arXiv: 2605.06367 by Chenxiao Ma, Enrico Ventura, Flavio Nicoletti, Luca Saglietti, Stefano Sarao Mannelli.

Figure 1
Figure 1. Figure 1: Strong imbalance induces separation in speciation times. Root mean square (RMS) overlaps between the projected centroids µc (Eq.(10)) and the top principal components (ψ1, ψ2) of Ugep, defined as RMSc ≡ Mean(P2 i=1(µc · ψi) 2/∥µc∥ 2 ) 1/2 , as a function of rescaled diffusion time et ≡ t/ log N. Left panel correspond to weak imbalance (b1 = 0.7, b2 = 0.3, Eq.(12)), while the remaining panels illustrate ins… view at source ↗
Figure 2
Figure 2. Figure 2: Strong class diversity can reduce the generalization window. Parameters for theory: χp = 60.0, χm = 30.0, b1 = b2 = 0.5, σ 2 1 = 0.5, σ2 2 = 0.25, ∥m1∥= ∥m2∥= 0, t = 0.001. Parameters for finite-size simulations: N = 100, P = 6000, M = 3000, Nepoch = 2 × 106 , learning rate η = 5 × 10−5N/∆t, Nspectrum run = 10, Ntrain run = 20. Training time in (a) is rescaled as τ → τ /ηe, where ηe = η∆t/N. 3.2 Timescale … view at source ↗
Figure 3
Figure 3. Figure 3: Class heterogeneity induces class-specific timescales. Class-conditional test errors (Eq. 14) evaluated over training time across three structural ablations relative to a perfectly symmetric baseline (σ 2 1 = σ 2 2 = 0.5, ∥m1∥= ∥m2∥= √ N, b1 = b2 = 0.5, m1 ·m2 = 0). The perturbations are: variance disparity (σ 2 1 = 0.5, σ2 2 = 0.25; left), centroid norm disparity (∥m1∥= √ N, ∥m2∥= 3 2 √ N; center), and sa… view at source ↗
Figure 4
Figure 4. Figure 4: Imbalance can reverse the order of learning for classes. In both subfigures, solid lines represent the￾oretical predictions derived in Appendix G, while crosses represent empirical average gaps from gradient descent simulations. Parameters: χp = 60.0, χm = 30.0, t = 0.01, N = 100, Nepoch = 2 × 106 , learning rate η = 5 × 10−5 , Ntrain run = 50. Result 3 (The Generalization-Memorization Trade-off). Sample i… view at source ↗
Figure 5
Figure 5. Figure 5: DDPM memorization gaps on Fashion MNIST. (a) Per-class memorization fractions for Sneaker-Coat and Sneaker-Bag DDPMs. The horizontal dashed line marks the threshold f c mem(t) = 1/3 used to define t c mem; vertical shaded regions show the resulting Sneaker-partner time gap ∆tmem(cp). Within each pair, the higher-variance partner class is memorized before Sneaker. Across pairs, the Bag-Sneaker DDPM exhibits… view at source ↗
Figure 6
Figure 6. Figure 6: DDPM–RF comparison across three descriptors. For each Sneaker-partner pair (csnk, cp), we compare the DDPM memorization gap on Fashion MNIST with the corresponding RF memorization gap on Gaussian synthetic data. The RF model uses pair descriptors extracted from the data: partner variance, normalized partner centroid norm, and centroid cosine similarity to Sneaker. The panels plot both gaps as functions of … view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the GEP derived in eq. (50) with the spectrum of view at source ↗
Figure 8
Figure 8. Figure 8: Test of eqs. (114) for t = 0.01. γc1P (note that, for an odd activation, the coefficient αc vanishes in this limit). Second, note that g (Ψ,h) c → h 2 c gΨ, g (Ψ,ς) c → ςc gΨ, g (Ψ,γ) cc′ → γcγc ′ gΨ, g (Ω) cc′ → γcγc ′gΩ; to handle singular terms such as determinants and inverses of these last two resolvents appearing in (90), set g (Ψ,γ) cc = γ 2 c gΨ + ϵ, g (Ω) cc = γ 2 c gΩ + ϵ, expand and simplify. Th… view at source ↗
Figure 9
Figure 9. Figure 9: Check of the scaling of λ (2) mem. We show that the eigenvalue λ (2) mem scales linearly with t and converges to a constant value for large number of hidden neurons and number of samples. We also show the width ∆λ of the leftmost bulk in fig. 2a as 36 view at source ↗
Figure 10
Figure 10. Figure 10: Comparison between the test losses and the score MSEs. Parameters as in figure 3. view at source ↗
Figure 11
Figure 11. Figure 11: Overview example for the Bag class. 44 view at source ↗
Figure 12
Figure 12. Figure 12: Overview examples for Fashion-MNIST classes (Part 1). view at source ↗
Figure 12
Figure 12. Figure 12: Overview examples for Fashion-MNIST classes (Part 2). view at source ↗
Figure 12
Figure 12. Figure 12: Overview examples for Fashion-MNIST classes (Part 3). view at source ↗
Figure 12
Figure 12. Figure 12: Overview examples for Fashion-MNIST classes (Part 4). view at source ↗
Figure 12
Figure 12. Figure 12: Overview examples for Fashion-MNIST classes (Part 5). view at source ↗
Figure 12
Figure 12. Figure 12: Overview examples for Fashion-MNIST classes (Part 6). view at source ↗
Figure 12
Figure 12. Figure 12: Overview examples for Fashion-MNIST classes (Part 7). view at source ↗
Figure 12
Figure 12. Figure 12: Overview examples for Fashion-MNIST classes (Part 8). view at source ↗
Figure 13
Figure 13. Figure 13: Memorization fraction curves for all Sneaker-paired Fashion MNIST classes. view at source ↗
Figure 14
Figure 14. Figure 14: Left: normalized variance for each class on two datasets; the annotated numbers report the total class view at source ↗
Figure 16
Figure 16. Figure 16: Same plot as Figure 6, but using alternative thresholds to compute the DDPM memorization gap view at source ↗
Figure 17
Figure 17. Figure 17: Same plot as Figure 6, but using alternative RF memorization thresholds. The RF threshold is defined by view at source ↗
read the original abstract

Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring higher-variance classes-while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion. Together, these results suggest that diffusion models can memorize some classes while others remain insufficiently learned. We validate our theoretical predictions empirically using U-Net models trained on Fashion MNIST.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper develops a high-dimensional analytical framework for class-dependent learning in score-based diffusion models by analyzing a random-features model trained on Gaussian mixtures. It derives the feature-covariance spectrum to characterize per-class generalization and memorization times, revealing an explicit hierarchy: class variance is the primary determinant of learning order (favoring higher-variance classes), centroid geometry plays a secondary role, and sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, induce distinct delayed speciation times for minority classes during backward diffusion. These theoretical predictions are checked empirically using U-Net diffusion models trained on Fashion-MNIST.

Significance. If the central hierarchy holds, the work offers a valuable analytical tool for understanding how data structure and imbalance shape generalization-memorization transitions in diffusion models, moving beyond homogeneous-data assumptions in prior theory. The explicit derivation from the feature-covariance spectrum and the use of an analytically tractable proxy model are strengths that enable precise predictions; the Fashion-MNIST validation provides initial empirical grounding. This could inform mitigation of class disparities in trained models.

major comments (3)
  1. [§4.2] §4.2, the derivation of the feature-covariance spectrum: the hierarchy (variance primary over centroid geometry) follows directly from the spectrum eigenvalues, but the paper provides no explicit high-dimensional asymptotic comparison or dominance proof showing variance terms overwhelm centroid contributions for all parameter regimes; this is load-bearing for the primary-determinant claim.
  2. [§5.1] §5.1, definition of speciation times: these times are extracted from the same per-class covariance spectrum used to order generalization/memorization, creating a risk that the 'delayed speciation' result under imbalance is partly definitional rather than independently predictive; a separate operational definition or cross-check against the true score-matching loss would strengthen the claim.
  3. [§6] §6, empirical validation: while qualitative agreement with the predicted ordering is shown on Fashion-MNIST, the experiments do not report quantitative alignment (e.g., correlation between predicted and observed per-class learning times) or error bars on the U-Net runs, leaving the support for extending the random-features hierarchy to nonlinear U-Nets moderate.
minor comments (2)
  1. [§2] Notation for the backward diffusion process and the precise definition of 'speciation' could be consolidated in one place (currently split across §2 and §5) to improve readability.
  2. The abstract states the hierarchy 'consistently favoring higher-variance classes' but the main text should add a short caveat on the regime where this holds (e.g., when imbalance is not extreme).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and outline our responses below, along with the revisions we plan to implement.

read point-by-point responses
  1. Referee: [§4.2] §4.2, the derivation of the feature-covariance spectrum: the hierarchy (variance primary over centroid geometry) follows directly from the spectrum eigenvalues, but the paper provides no explicit high-dimensional asymptotic comparison or dominance proof showing variance terms overwhelm centroid contributions for all parameter regimes; this is load-bearing for the primary-determinant claim.

    Authors: We appreciate the referee highlighting this point. We acknowledge that an explicit high-dimensional asymptotic comparison or dominance proof was not included in the original manuscript. In the revised version, we will add such an analysis, deriving the asymptotic behavior of the spectrum eigenvalues to show that variance terms dominate the centroid contributions in the high-dimensional regime. revision: yes

  2. Referee: [§5.1] §5.1, definition of speciation times: these times are extracted from the same per-class covariance spectrum used to order generalization/memorization, creating a risk that the 'delayed speciation' result under imbalance is partly definitional rather than independently predictive; a separate operational definition or cross-check against the true score-matching loss would strengthen the claim.

    Authors: We thank the referee for this observation. The speciation times are indeed defined via the per-class feature-covariance spectrum to provide an analytical characterization. To address the concern of circularity, we will include in the revision an additional cross-validation: we compute the per-class score-matching loss on held-out data during training and demonstrate that the predicted delayed speciation times align with the empirical loss curves for minority classes under strong imbalance. This provides an independent operational check. revision: yes

  3. Referee: [§6] §6, empirical validation: while qualitative agreement with the predicted ordering is shown on Fashion-MNIST, the experiments do not report quantitative alignment (e.g., correlation between predicted and observed per-class learning times) or error bars on the U-Net runs, leaving the support for extending the random-features hierarchy to nonlinear U-Nets moderate.

    Authors: We agree that quantitative metrics would provide stronger support. In the revised version, we will report the correlation coefficients between the theoretically predicted learning times (from the random-features model) and the observed per-class generalization/memorization times in the U-Net experiments. Additionally, we will include error bars from multiple independent runs of the U-Net training to quantify variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained model analysis

full rationale

The paper defines a random-features model on Gaussian mixtures, analytically derives the feature-covariance spectrum from that model's covariance structure, and uses the resulting spectrum to characterize per-class generalization and memorization times. This is a direct mathematical consequence of the model definition rather than a post-hoc fit renamed as prediction or a self-referential loop. The claimed hierarchy (variance primary, centroid secondary, imbalance as modulator with delayed minority speciation) follows from the spectrum equations without reducing to the inputs by construction. Empirical validation on U-Net models trained on Fashion MNIST is presented as separate confirmation, not part of the derivation. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are present in the abstract or described chain. The speciation times are defined with respect to the backward diffusion process in the same model but do not create a tautological equivalence to the input assumptions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The analysis rests on the standard high-dimensional random-features approximation and Gaussian-mixture data model; no new entities are postulated and the only free parameters are the class variances and sampling rates supplied by the data-generating process.

free parameters (2)
  • class variances
    Relative variances of the Gaussian components are inputs that directly set the primary learning order; they are not fitted to the target result.
  • sampling rates per class
    Imbalance ratios are chosen as part of the experimental design and modulate the ordering.
axioms (1)
  • domain assumption High-dimensional limit in which the random-features model yields an exact feature-covariance spectrum
    Invoked to obtain closed-form expressions for per-class generalization and memorization times.

pith-pipeline@v0.9.0 · 5508 in / 1349 out tokens · 52711 ms · 2026-05-08T04:55:39.631806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    arXiv preprint arXiv:2410.08727 , year=

    B. Achilli et al. “Losing dimensions: Geometric memorization in generative diffusion”. In:arXiv:2410.08727 (2024)

  2. [2]

    Memorization and generalization in generative diffusion under the manifold hypothesis

    B. Achilli et al. “Memorization and generalization in generative diffusion under the manifold hypothesis”. In: Journal of Statistical Mechanics: Theory and Experiment2025.7 (2025), p. 073401

  3. [3]

    Theory of speciation transitions in diffusion models with general class structure

    B. Achilli et al. “Theory of speciation transitions in diffusion models with general class structure”. In:Journal of Statistical Mechanics: Theory and Experiment2026.4 (2026), p. 043304

  4. [4]

    The statistical thermodynamics of generative diffusion models: Phase transitions, symmetry breaking and critical instability

    L. Ambrogioni. “The statistical thermodynamics of generative diffusion models: Phase transitions, symmetry breaking and critical instability”. In:arXiv:2310.17467(2024)

  5. [5]

    Reverse-time diffusion equation models

    B. D. Anderson. “Reverse-time diffusion equation models”. In:Stochastic Processes and their Applications 12.3 (1982), pp. 313–326

  6. [6]

    Generative diffusion in very large dimensions

    G. Biroli and M. M ´ezard. “Generative diffusion in very large dimensions”. In:Journal of Statistical Mechanics: Theory and Experiment2023.9 (2023), p. 093402

  7. [7]

    Dynamical regimes of diffusion models

    G. Biroli et al. “Dynamical regimes of diffusion models”. In:Nature Communications15.1 (2024), p. 9957

  8. [8]

    Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training

    T. Bonnaire et al. “Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training”. In:Advances in Neural Processing Systems(2025)

  9. [9]

    Charbonneau et al.Spin glass theory and far beyond: replica symmetry breaking after 40 years

    P. Charbonneau et al.Spin glass theory and far beyond: replica symmetry breaking after 40 years. World Sci- entific, 2023

  10. [10]

    A solvable model of learning generative diffusion: theory and insights

    H. Cui, C. Pehlevan, and Y . M. Lu. “A solvable model of learning generative diffusion: theory and insights”. In:arXiv preprint arXiv:2501.03937(2025)

  11. [11]

    Analysis of learning a flow-based generative model from limited sample complexity

    H. Cui et al. “Analysis of learning a flow-based generative model from limited sample complexity”. In:arXiv preprint arXiv:2310.03575(2023)

  12. [12]

    Universality laws for gaussian mixtures in generalized linear models

    Y . Dandi et al. “Universality laws for gaussian mixtures in generalized linear models”. In:Advances in Neural Information Processing Systems36 (2023), pp. 54754–54768

  13. [13]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    J. Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In:Proceed- ings of the 2019 conference of the North American association for computational linguistics(2019), pp. 4171– 4186

  14. [14]

    The eigenvalue spectrum of a large symmetric random matrix

    S. F. Edwards and R. C. Jones. “The eigenvalue spectrum of a large symmetric random matrix”. In:Journal of Physics A: Mathematical and General9.10 (1976), p. 1595

  15. [15]

    arXiv preprint arXiv:2505.16959 (2025)

    A. Favero, A. Sclocchi, and M. Wyart. “Bigger Isn’t Always Memorizing: Early Stopping Overparameterized Diffusion Models”. In:arXiv preprint arXiv:2505.16959(2025)

  16. [16]

    Analysis of diffusion models for manifold data

    A. J. George, R. Veiga, and N. Macris. “Analysis of diffusion models for manifold data”. In:2025 IEEE Inter- national Symposium on Information Theory (ISIT). IEEE. 2025, pp. 1–6

  17. [17]

    Denoising Score Matching with Random Features: Insights on Diffusion Models from Precise Learning Curves

    A. J. George, R. Veiga, and N. Macris. “Denoising Score Matching with Random Features: Insights on Diffusion Models from Precise Learning Curves”. In:Arxiv:2502.00336(2025)

  18. [18]

    Generalisation error in learning with random features and the hidden manifold model

    F. Gerace et al. “Generalisation error in learning with random features and the hidden manifold model”. In: International Conference on Machine Learning. PMLR. 2020, pp. 3452–3462

  19. [19]

    Modeling the influence of data structure on learning in neural networks: The hidden manifold model

    S. Goldt et al. “Modeling the influence of data structure on learning in neural networks: The hidden manifold model”. In:Physical Review X10.4 (2020), p. 041044

  20. [20]

    On memorization in diffusion models.arXiv preprint arXiv:2310.02664, 2023

    X. Gu et al. “On memorization in diffusion models”. In:arXiv preprint arXiv:2310.02664(2023)

  21. [21]

    Deep residual learning for image recognition

    K. He et al. “Deep residual learning for image recognition”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778

  22. [22]

    J. Ho, A. Jain, and P. Abbeel.Denoising Diffusion Probabilistic Models. 2020

  23. [23]

    Universal language model fine-tuning for text classification

    J. Howard and S. Ruder. “Universal language model fine-tuning for text classification”. In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 328– 339

  24. [24]

    Bias in motion: Theoretical insights into the dynamics of bias in sgd training

    A. Jain et al. “Bias in motion: Theoretical insights into the dynamics of bias in sgd training”. In:Advances in Neural Information Processing Systems37 (2024), pp. 24435–24471

  25. [25]

    Understanding and mitigating memorization in generative models via sharpness of probability landscapes

    D. Jeon, D. Kim, and A. No. “Understanding and mitigating memorization in generative models via sharpness of probability landscapes”. In:International Conference on Learning Representations(2025)

  26. [26]

    Stage-wise dynamics of classifier-free guidance in diffusion models

    C. Jin, Q. Shi, and Y . Gu. “Stage-wise dynamics of classifier-free guidance in diffusion models”. In:arXiv preprint arXiv:2509.22007(2025)

  27. [27]

    Generalization in diffusion models arises from geometry-adaptive harmonic representa- tion

    Z. Kadkhodaie et al. “Generalization in diffusion models arises from geometry-adaptive harmonic representa- tion”. In:International Conference on Learning Representations(2024). 10 The Interplay of Data Structure and ImbalanceA PREPRINT

  28. [28]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba. “Adam: A method for stochastic optimization”. In:arXiv preprint arXiv:1412.6980 (2014)

  29. [29]

    Krizhevsky, G

    A. Krizhevsky, G. Hinton, et al.Learning multiple layers of features from tiny images. Toronto, ON, Canada, 2009

  30. [30]

    A simple weight decay can improve generalization

    A. Krogh and J. Hertz. “A simple weight decay can improve generalization”. In:Advances in neural information processing systems4 (1991)

  31. [31]

    Diffusion models already have a semantic latent space.arXiv preprint arXiv:2210.10960, 2022

    M. Kwon, J. Jeong, and Y . Uh. “Diffusion models already have a semantic latent space”. In:arXiv preprint arXiv:2210.10960(2022)

  32. [32]

    How diffusion models learn to factorize and compose

    Q. Liang et al. “How diffusion models learn to factorize and compose”. In:Advances in Neural Information Processing Systems37 (2024), pp. 15121–15148

  33. [33]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    I. Loshchilov and F. Hutter. “Sgdr: Stochastic gradient descent with warm restarts”. In:arXiv preprint arXiv:1608.03983(2016)

  34. [34]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. “Decoupled weight decay regularization”. In:arXiv preprint arXiv:1711.05101 (2017)

  35. [35]

    A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization

    V . C. Mendes et al. “A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization”. In:arXiv:2602.10680(2026)

  36. [36]

    Generalization dynamics of linear diffusion models

    C. Merger and S. Goldt. “Generalization dynamics of linear diffusion models”. In:arXiv preprint arXiv:2505.24769(2025)

  37. [37]

    Dynamical decoupling of generalization and overfitting in large two-layer networks,

    A. Montanari and P. Urbani. “Dynamical decoupling of generalization and overfitting in large two-layer net- works”. In:arXiv preprint arXiv:2502.21269(2025)

  38. [38]

    Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering

    W. Peng et al. “Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering”. In:arXiv preprint arXiv:2409.02426(2024)

  39. [39]

    Analyzing bias in diffusion-based face generation models

    M. V . Perera and V . M. Patel. “Analyzing bias in diffusion-based face generation models”. In:2023 IEEE International Joint Conference on Biometrics (IJCB). IEEE. 2023, pp. 1–10

  40. [40]

    Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model

    F. S. Pezzicoli et al. “Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model”. In: Proceedings of Machine Learning Research (2025). Ed. by Y . Li et al., pp. 1261–1269.URL:https : //proceedings.mlr.press/v258/pezzicoli25a.html

  41. [41]

    Potters and J.-P

    M. Potters and J.-P. Bouchaud.A first course in random matrix theory: for physicists, engineers and data scientists. Cambridge University Press, 2020

  42. [42]

    Random features for large-scale kernel machines

    A. Rahimi and B. Recht. “Random features for large-scale kernel machines”. In:Advances in neural information processing systems20 (2007)

  43. [43]

    Spontaneous symmetry breaking in generative diffusion models

    G. Raya and L. Ambrogioni. “Spontaneous symmetry breaking in generative diffusion models”. In:Advances in Neural Information Processing Systems(2023)

  44. [44]

    U-net: Convolutional networks for biomedical image segmentation

    O. Ronneberger, P. Fischer, and T. Brox. “U-net: Convolutional networks for biomedical image segmentation”. In:International Conference on Medical image computing and computer-assisted intervention. Springer. 2015, pp. 234–241

  45. [45]

    A Geometric Framework for Understanding Memorization in Generative Models

    B. L. Ross et al. “A Geometric Framework for Understanding Memorization in Generative Models”. In:ICML 2024 Next Generation of AI Safety Workshop. 2024

  46. [46]

    An investigation of why overparameterization exacerbates spurious correlations

    S. Sagawa et al. “An investigation of why overparameterization exacerbates spurious correlations”. In:Interna- tional Conference on Machine Learning. PMLR. 2020, pp. 8346–8356

  47. [47]

    Bias-inducing geometries: An exactly solvable data model with fairness implications

    S. Sarao Mannelli et al. “Bias-inducing geometries: An exactly solvable data model with fairness implications”. In:Physical Review E112.2 (2025), p. 025304

  48. [48]

    Dissecting and mitigating diffusion bias via mechanistic interpretability

    Y . Shi et al. “Dissecting and mitigating diffusion bias via mechanistic interpretability”. In:Proceedings of the Computer Vision and Pattern Recognition Conference. 2025, pp. 8192–8202

  49. [49]

    Maximum likelihood training of score-based diffusion models

    Y . Song et al. “Maximum likelihood training of score-based diffusion models”. In:Advances in neural informa- tion processing systems34 (2021), pp. 1415–1428

  50. [50]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y . Song et al. “Score-Based Generative Modeling through Stochastic Differential Equations”. In:International Conference on Learning Representations(2021)

  51. [51]

    Dropout: a simple way to prevent neural networks from overfitting

    N. Srivastava et al. “Dropout: a simple way to prevent neural networks from overfitting”. In:The journal of machine learning research15.1 (2014), pp. 1929–1958

  52. [52]

    Your diffusion model secretly knows the dimension of the data manifold

    J. Stanczuk et al. “Your diffusion model secretly knows the dimension of the data manifold”. In: arXiv:2207.09786(2023)

  53. [53]

    Rethinking the inception architecture for computer vision

    C. Szegedy et al. “Rethinking the inception architecture for computer vision”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 2818–2826

  54. [54]

    Attention is all you need

    A. Vaswani et al. “Attention is all you need”. In:Advances in neural information processing systems30 (2017). 11 The Interplay of Data Structure and ImbalanceA PREPRINT

  55. [55]

    Manifolds, Random Matrices and Spectral Gaps: The geometric phases of generative diffu- sion

    E. Ventura et al. “Manifolds, Random Matrices and Spectral Gaps: The geometric phases of generative diffu- sion”. In:International Conference on Learning Representations(2025)

  56. [56]

    Exploring bias in over 100 text-to-image generative models

    J. Vice et al. “Exploring bias in over 100 text-to-image generative models”. In:arXiv preprint arXiv:2503.08012 (2025)

  57. [57]

    An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models

    B. Wang and C. Pehlevan. “An analytical theory of spectral bias in the learning dynamics of diffusion models”. In:arXiv preprint arXiv:2503.03206(2025)

  58. [58]

    The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications

    B. Wang and J. J. Vastola. “The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications”. In:Transactions on Machine Learning Research(2024)

  59. [59]

    The Diffusion Process as a Correlation Machine: Linear Denoising Insights

    D. Weitzner et al. “The Diffusion Process as a Correlation Machine: Linear Denoising Insights”. In:Transac- tions on Machine Learning Research(2025)

  60. [60]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    H. Xiao, K. Rasul, and R. V ollgraf. “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms”. In:arXiv preprint arXiv:1708.07747(2017)

  61. [61]

    Diffusion probabilistic models generalize when they fail to memorize

    T. Yoon et al. “Diffusion probabilistic models generalize when they fail to memorize”. In:ICML 2023 workshop on structured probabilistic inference & generative modeling. 2023. 12 The Interplay of Data Structure and ImbalanceA PREPRINT Appendices A Notations 13 B Further Related Works 14 Appendices for theory 16 C The solution of the dynamics 16 D The GEP ...

  62. [62]

    Other works [36, 57] focus on the dynamical manner in which linear denoisers learn the data covariance

    showed that while PCA fails to recover signals in a spiked cumulant model, shallow non-linear autoencoders succeed. Other works [36, 57] focus on the dynamical manner in which linear denoisers learn the data covariance

  63. [63]

    nY a=1 Z dψa e− 1 2 ψ⊺ a(U−zIP )ψa # . (66) Then, we replaceUwith its GEP in (50): EX,W

    shows that directions of large variance are learned first during training, consistent with the fact that such directions are sampled earlier in the diffusion dynamics [55]. Previous studies characterized linear auto-encoders as correlation machines utilizing power iteration for generative modeling [59]. Within more complex architectures, U-Net diffusion m...