pith. sign in

arxiv: 2503.03206 · v3 · submitted 2025-03-05 · 💻 cs.LG · cs.CV· math.ST· stat.ML· stat.TH

An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models

Pith reviewed 2026-05-23 01:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CVmath.STstat.MLstat.TH
keywords diffusion modelsspectral biaslearning dynamicsGaussian equivalencedenoising networksprobability flowconvolutional biasmode emergence order
0
0 comments X

The pith

Diffusion models learn high-variance modes orders of magnitude faster than low-variance ones because matching time scales as the inverse of mode variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper solves the full-batch gradient-flow dynamics of linear and convolutional denoisers after replacing the data distribution with a matching Gaussian, then integrates the probability-flow ODE to obtain closed-form expressions for the evolving generated distribution. This produces a universal inverse-variance spectral law in which the time for any eigen- or Fourier mode to reach its target variance is proportional to one over that variance. A sympathetic reader cares because the law directly predicts that global, high-variance structure appears long before fine detail, and because the same ordering survives in deep MLPs while local convolution qualitatively changes it. Experiments on both Gaussian and natural-image data confirm the predicted ordering under standard architectures.

Core claim

Under the solved dynamics the time constant for an eigenmode or Fourier mode with variance λ to match its target variance is τ ∝ λ^{-1}, so high-variance (coarse) structure is mastered orders of magnitude sooner than low-variance (fine) detail; weight sharing only multiplies all rates while local convolution produces near-simultaneous mode emergence.

What carries the argument

The inverse-variance spectral law obtained by solving the gradient-flow ODE for the denoiser under Gaussian equivalence and integrating the resulting probability-flow ODE.

If this is right

  • Weight sharing in deep linear networks and circulant convolutions multiplies every learning rate by the same factor and therefore preserves the inverse-variance ordering.
  • Local convolution produces a qualitatively different bias in which many modes reach their targets nearly simultaneously.
  • Data covariance alone determines the order and relative speed of structure acquisition in MLP-based diffusion models.
  • The spectral law continues to hold in deep U-Nets trained on both synthetic Gaussian data and natural images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Changing the covariance spectrum of the training set should directly reorder which image features appear first during training.
  • Architectures that avoid local convolution may be needed if an application requires fine detail to emerge early rather than late.
  • The same inverse-variance mechanism could be tested in other score-based or denoising generative models whose training dynamics admit a similar linearization.

Load-bearing premise

The data distribution can be replaced by a matching Gaussian for the purpose of solving the gradient-flow dynamics without altering the predicted mode-matching times.

What would settle it

Train a diffusion model on data whose covariance spectrum is known, record the training step at which each Fourier or eigenmode first reaches a fixed fraction of its target variance, and test whether those steps scale linearly with the reciprocal of mode variance.

Figures

Figures reproduced from arXiv: 2503.03206 by Binxu Wang, Cengiz Pehlevan.

Figure 1
Figure 1. Figure 1: Spectral-bias schematic. Learning and sam￾pling together impose a variance-ordered bias along covariance eigenmodes. Diffusion models create rich data by gradually transforming Gaussian noise into signal, a paradigm that now drives state-of-the-art generation in vision, audio, and molecular design [1, 2, 3]. Yet two basic questions remain open. (i) Which parts of the data distribution do these models learn… view at source ↗
Figure 2
Figure 2. Figure 2: Learning dynamics per eigenmode. Top: one-layer linear denoiser. Bottom: two-layer symmetric denoiser. (A,D) Weight trajectories u ⊤ kWσ(τ )uk (σ= 1). (B,E) Generated-variance λ˜ k versus target variance λk. (C,F) Power-law relation between emergence time τ ∗ k and λk. Interpretation. Each eigenmode projec￾tion of the weight Wσuk converges to the optimal value W∗ σuk exponentially with rate (σ 2 + λk); hen… view at source ↗
Figure 3
Figure 3. Figure 3: Spectral Learning Dynamics of MLP-UNet (FFHQ32). A. Generated samples during training. B. Evolution of sample variance λ˜ k(τ ) across eigenmodes during training. C. Heatmap of variance trajectories along all eigenmodes, with dots marking mode emergence times τ ∗ (first-passage time at the geometric mean of initial and final variances). The gray zone (0.5–2× target variance) indicates modes starting too cl… view at source ↗
Figure 4
Figure 4. Figure 4: Learning dynamics of UNet differs | FFHQ32. A. Sample trajectory from CNN-UNet. B. Variance evolution along covariance eigenmodes. (c.f. Fig. 3A.C.) Experiment 2: Natural Image Datasets x MLP Next, we validated our theory on natural image datasets. We flattened the images as a vectors, and trained a deeper and wider MLP￾UNet to learn the distribution. Using FFHQ as our running example, moni￾toring the gene… view at source ↗
Figure 5
Figure 5. Figure 5: Learning dynamics of the weight and variance of the generated distribution per eigenmode (continued) Top Single layer linear denoiser. Bottom Symmetric two-layer denoiser. A.C. Learning dynamics of u ⊺ kW(τ )uk. B.D. Learning dynamics of the variance of the generated distribution λ˜ k, as a function of the variance of the target eigenmode λk. This case with larger amplitude weight initialization Qk = 0.5. … view at source ↗
Figure 6
Figure 6. Figure 6: Power law relationship between mode emergence time and target mode vari￾ance for one-layer and two-layer linear denoisers. Panels (A) and (B) respectively plot the Mode variance against the Emergence Step for different values of weight initialization Qk ∈ {0.0, 0.1, 0.5, 0.6, 1.0} (columns), for one layer and two layer linear denoser (rows). We used σ0 = 0.002 and σT = 80. The emergence steps were quantifi… view at source ↗
Figure 7
Figure 7. Figure 7: Learning dynamics of the weight and variance of the generated distribution per eigenmode, for one layer linear flow matching model Similar plotting format as [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Power law relationship between mode emergence time and target mode variance for one-layer linear flow matching. Panels (A) and (B) respectively plot the Mode variance against the Emergence Step for different values of weight initialization Qk ∈ {0.0, 0.1, 0.5, 0.6, 1.0} (columns), for one layer linear flow model. The emergence steps were quantified via different criterions, via harmonic mean in A, and geom… view at source ↗
Figure 9
Figure 9. Figure 9: Spectral Learning Dynamics of MLP-UNet (Gaussian-rotated). (same layout and analysis procedure as main [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Dynamical alignment onto the covariance eigenframe of data (MLP-UNet, FFHQ32, AFHQ32). Alignment score χ as function of training step. Alignment score defined as the sum of square of diagonal entries of the rotated sample covariance on the training data eigenframe U T Σ˜ τU, divided by the sum of square of all entries. This quantifies how well the training data eigenframe diagonalizes the generated sample… view at source ↗
Figure 11
Figure 11. Figure 11: Spectral Learning Dynamics of MLP-UNet (MNIST, CIFAR10, AFHQ32, FFHQ32- fixword, random word). A. Generated samples during training. B. Evolution of sample variance λ˜ k(τ ) across eigenmodes during training. C. Heatmap of variance trajectories along all eigenmodes, with dots marking mode emergence times τ ∗ (first-passage time at the geometric mean of initial and final variances). The gray zone (0.5–2× t… view at source ↗
Figure 12
Figure 12. Figure 12: Spectral Learning Dynamics of CNN-UNet (FFHQ32). (same layout and analysis procedure as main [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Dynamical alignment onto the covariance eigenframe of data (CNN-UNet, FFHQ32). Alignment score r as function of training step. Same analysis as Fig.10. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: UNet denoiser can be approximated by linear convolution early in training (CNN￾UNet, FFHQ32). A. Early in training, the UNet denoiser output can be well approximated by a linear convolutional layer, with a patch size P. B. The approximation error as a function of patch size P, training time τ and noise scale σ. Generally, early in training, the denoiser is very local and linear, well approximated by a lin… view at source ↗
Figure 15
Figure 15. Figure 15: Visualizing the denoiser training dynamics with a fixed image and noise seed (CNN￾UNet, FFHQ32). D(x + σz, σ) as a function of training time τ and noise scale σ [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Spectral Bias in Whole Image of CNN learning | MNIST Training dynamics of sample (whole image) variance along eigenbasis of training set, normalized by target variance. Upper 0-100 eigen modes, Lower 0-500 eigenmodes. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Spectral Bias in CNN-Based Diffusion Learning: Variance Dynamics in Image Patches | MNIST (32 pixel resolution). Left, Raw variance of generated patches along true eigenbases during training. Right, Scaling relationship between the target variance of eigenmode versus mode emergence time (harmonic mean criterion). Each row corresponds to a different patch size and stride used for extracting patches from im… view at source ↗
Figure 18
Figure 18. Figure 18: Spectral Bias in CNN-Based Diffusion Learning: Variance Dynamics in Image Patches | CIFAR10 (32 pixel resolution). Left, Raw variance of generated patches along true eigenbases during training. Right, Scaling relationship between the target variance of eigenmode versus mode emergence time (harmonic mean criterion). Each row corresponds to a different patch size and stride used for extracting patches from … view at source ↗
Figure 19
Figure 19. Figure 19: Spectral Bias in CNN-Based Diffusion Learning: Variance Dynamics in Image Patches | FFHQ (64 pixel resolution). Left, Raw variance of generated patches along true eigenbases during training. Right, Scaling relationship between the target variance of eigenmode versus mode emergence time (harmonic mean criterion). Each row corresponds to a different patch size and stride used for extracting patches from ima… view at source ↗
Figure 20
Figure 20. Figure 20: Spectral Bias in CNN-Based Diffusion Learning: Variance Dynamics in Image Patches | AFHQv2 (64 pixel resolution). Left, Raw variance of generated patches along true eigenbases during training. Right, Scaling relationship between the target variance of eigenmode versus mode emergence time (harmonic mean criterion). Each row corresponds to a different patch size and stride used for extracting patches from i… view at source ↗
Figure 21
Figure 21. Figure 21: Interaction of mean and covariance learning. Top solution to the w, b dynamics under different noise level σ ∈ {0.1, 1.5, 4}. Bottom Phase portraits corresponding to the two-d system. (m = 1, λk = 1) with a fixed dynamic matrix M defined by ⊗ Kronecker product. M :=    σ 2 + λ1 + m2 1 m1m2 . m1 m1m2 σ 2 + λ2 + m2 2 . m2 ... ... ... . m1 m2 . 1    ⊗ Id := M˜ ⊗ Id (141) Remark D.3. The dynamics matrix… view at source ↗
Figure 22
Figure 22. Figure 22: Phase diagram for the simplified 2d dynamic system. above d dτ f = Ag−Bg2f, d dτ g = Af − Bf 2 g for A = 1, B = 2 we can see the manifold of stable attractors in 1 and 3rd quadrangle. fg = A/B, and the conserved quantity along the hyperbolic lines. Now the equation for fg becomes closed d dτ (fg) = p 4f 2g 2 + C2(A − Bfg) Let h = fg d dτ h = p 4h 2 + C2(A − Bh) Note that for this self contained equation, … view at source ↗
Figure 23
Figure 23. Figure 23: Example of learning to generate low-dimensional manifold with Song UNet-inspired MLP denoiser. Pstd = 1.2). Specifically, the noise level σ is sampled via σ = exp Pmean + Pstd ϵ  , ϵ ∼ N (0, 1). The weighting function per noise scale is defined as: w(σ) = σ 2 + σ 2 data (σ σdata) 2 , with hyperparameter σdata (e.g., σdata = 0.5). The noisy input y is created by the following, y = x + σn, n ∼ N 0, Id  , … view at source ↗
read the original abstract

We develop an analytical framework for understanding how the generated distribution evolves during diffusion model training. Leveraging a Gaussian-equivalence principle, we solve the full-batch gradient-flow dynamics of linear and convolutional denoisers and integrate the resulting probability-flow ODE, yielding analytic expressions for the generated distribution. The theory exposes a universal inverse-variance spectral law: the time for an eigen- or Fourier mode to match its target variance scales as $\tau\propto\lambda^{-1}$, so high-variance (coarse) structure is mastered orders of magnitude sooner than low-variance (fine) detail. Extending the analysis to deep linear networks and circulant full-width convolutions shows that weight sharing merely multiplies learning rates -- accelerating but not eliminating the bias -- whereas local convolution introduces a qualitatively different bias. Experiments on Gaussian and natural-image datasets confirm the spectral law persists in deep MLP-based UNet. Convolutional U-Nets, however, display rapid near-simultaneous emergence of many modes, implicating local convolution in reshaping learning dynamics. These results underscore how data covariance governs the order and speed with which diffusion models learn, and they call for deeper investigation of the unique inductive biases introduced by local convolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper develops an analytical framework for diffusion model training dynamics by invoking a Gaussian-equivalence principle to solve the full-batch gradient-flow ODEs for linear and convolutional denoisers, then integrating the probability-flow ODE to obtain closed-form expressions for the generated distribution. It derives a universal inverse-variance spectral law stating that the time for an eigen- or Fourier mode to match its target variance scales as τ ∝ λ^{-1}, so that high-variance coarse structure emerges orders of magnitude earlier than low-variance fine detail. The analysis is extended to deep linear networks (where weight sharing multiplies effective learning rates) and to circulant full-width convolutions, while experiments on synthetic Gaussians and natural images confirm the ordering for MLP-based UNets but show near-simultaneous mode emergence for convolutional UNets, implicating local convolution as a qualitatively different inductive bias.

Significance. If the central derivations hold, the work supplies a first-principles account of spectral bias in diffusion models that is grounded in data covariance rather than architecture-specific heuristics. The observation that the Gaussian-equivalence step is exact for linear denoisers (because squared-error loss depends only on second moments) removes a common source of approximation error and strengthens the theoretical claim; the experimental reproduction of the same ordering outside the linear regime further supports generality. The result directly explains why coarse structure appears early and offers a concrete scaling relation that can be tested or exploited in model design.

minor comments (2)
  1. The abstract states that local convolution 'introduces a qualitatively different bias' but does not indicate whether this is shown analytically or only observed experimentally; a single sentence clarifying the status of that claim would improve precision.
  2. Notation for the time variable τ and eigenvalue λ is introduced in the abstract without an immediate forward reference to the defining ODE; adding a brief parenthetical definition on first use would aid readers who begin with the abstract.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and supportive review of our manuscript. The assessment correctly captures the core contributions, including the Gaussian-equivalence principle for linear denoisers, the derivation of the inverse-variance spectral law, and the distinction between weight-sharing and local-convolution effects. As no specific major comments were raised, we have no point-by-point rebuttals to offer.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained first-principles ODE solution

full rationale

The central result τ∝λ^{-1} is obtained by analytically solving the gradient-flow ODE for linear denoisers after invoking Gaussian equivalence. For linear models the squared-error loss depends solely on second moments, so the equivalence is exact rather than approximate and the scaling follows directly from the linear dynamics without any fitted parameter being relabeled as a prediction. No self-citation is load-bearing for the uniqueness or form of the result, no ansatz is smuggled in, and the derivation does not reduce to its own inputs by construction. Experiments on natural images supply external confirmation outside the linear regime.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Gaussian-equivalence principle as the key modeling step that closes the dynamics; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Gaussian-equivalence principle that allows replacement of the data distribution by a Gaussian with matching covariance for the purpose of solving the gradient-flow equations.
    Invoked to obtain solvable linear dynamics for each eigenmode.

pith-pipeline@v0.9.0 · 5748 in / 1342 out tokens · 25162 ms · 2026-05-23T01:38:27.906275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models

    stat.ML 2026-05 unverdicted novelty 7.0

    Higher-variance classes are learned first in diffusion models; strong class imbalance reverses the order and imposes distinct delayed learning times on minority classes.

  2. Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine

    cs.LG 2026-05 unverdicted novelty 6.0

    SiLD is a score-matching framework that learns both manifold projection and intrinsic density from a single objective, with proven sample complexity depending only on intrinsic dimension.

  3. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  4. What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 4 Pith papers · 4 internal anchors

  1. [1]

    Weiss, Niru Maheswaranathan, and Surya Ganguli

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015

  2. [2]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  3. [3]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  4. [4]

    Statistics of natural image categories

    Antonio Torralba and Aude Oliva. Statistics of natural image categories. Network: computation in neural systems, 14(3):391, 2003

  5. [5]

    Temporal low-order statistics of natural sounds.Advances in neural information processing systems, 9, 1996

    Hagai Attias and Christoph Schreiner. Temporal low-order statistics of natural sounds.Advances in neural information processing systems, 9, 1996

  6. [6]

    Statistics of natural time-varying images

    Dawei W Dong and Joseph J Atick. Statistics of natural time-varying images. Network: computation in neural systems, 6(3):345, 1995

  7. [7]

    Eigenfaces for recognition.Journal of cognitive neuroscience, 3(1):71–86, 1991

    Matthew Turk and Alex Pentland. Eigenfaces for recognition.Journal of cognitive neuroscience, 3(1):71–86, 1991

  8. [8]

    A geometric analysis of deep generative image models and its applications

    Binxu Wang and Carlos R Ponce. A geometric analysis of deep generative image models and its applications. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=GH7QRzUDdXG

  9. [9]

    Binxu Wang and John J. Vastola. The Hidden Linear Structure in Score-Based Models and its Application. arXiv e-prints, art. arXiv:2311.10892, November 2023. doi: 10.48550/arXiv.2311. 10892

  10. [10]

    Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure

    Xiang Li, Yixiang Dai, and Qing Qu. Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure. arXiv preprint arXiv:2410.24060, 2024

  11. [11]

    The unreasonable effectiveness of gaussian score approxima- tion for diffusion models and its applications

    Binxu Wang and John Vastola. The unreasonable effectiveness of gaussian score approxima- tion for diffusion models and its applications. Transactions on Machine Learning Research, December 2024. arXiv preprint arXiv:2412.09726

  12. [12]

    On the spectral bias of neural networks

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International Conference on Machine Learning, pages 5301–5310. PMLR, 2019

  13. [13]

    Spectrum dependent learning curves in kernel regression and wide neural networks

    Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1024–1034. PMLR, 13–18 Jul 2020. URL https:...

  14. [14]

    Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(1):2914, May 2021

    Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(1):2914, May 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-23103-1. URL https://doi.org/10.1038/s41467-021-23103-1

  15. [15]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013

  16. [16]

    Implicit bias of gradient descent on linear convolutional networks

    Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems, 31, 2018

  17. [17]

    Learning dynamics of linear denoising autoencoders

    Arnu Pretorius, Steve Kroon, and Herman Kamper. Learning dynamics of linear denoising autoencoders. In International Conference on Machine Learning, pages 4141–4150. PMLR, 2018. 10

  18. [18]

    A unifying view on implicit bias in training linear neural networks

    Chulhee Yun, Shankar Krishnan, and Hossein Mobahi. A unifying view on implicit bias in training linear neural networks. arXiv preprint arXiv:2010.02501, 2020

  19. [19]

    Diffusion is spectral autoregression, 2024

    Sander Dieleman. Diffusion is spectral autoregression, 2024. URL https://sander.ai/2024/ 09/02/spectral-autoregression.html

  20. [20]

    Generative modelling with inverse heat dissipation

    Severi Rissanen, Markus Heinonen, and Arno Solin. Generative modelling with inverse heat dissipation. arXiv preprint arXiv:2206.13397, 2022

  21. [21]

    Wavelet score-based generative modeling

    Florentin Guth, Simon Coste, Valentin De Bortoli, and Stephane Mallat. Wavelet score-based generative modeling. Advances in neural information processing systems, 35:478–491, 2022

  22. [22]

    Diffusion models generate images like painters: an analytical theory of outline first, details later

    Binxu Wang and John J Vastola. Diffusion models generate images like painters: an analytical theory of outline first, details later. arXiv preprint arXiv:2303.02490, 2023

  23. [23]

    Dynamical regimes of diffusion models

    Giulio Biroli, Tony Bonnaire, Valentin De Bortoli, and Marc Mézard. Dynamical regimes of diffusion models. Nature Communications, 15(1):9957, 2024

  24. [24]

    Learning mixtures of gaussians using the ddpm objective

    Kulin Shah, Sitan Chen, and Adam Klivans. Learning mixtures of gaussians using the ddpm objective. Advances in Neural Information Processing Systems, 36:19636–19649, 2023

  25. [25]

    An analytic theory of creativity in convolutional diffusion models

    Mason Kamb and Surya Ganguli. An analytic theory of creativity in convolutional diffusion models. arXiv preprint arXiv:2412.20292, 2024

  26. [26]

    Elucidating the Design Space of Diffusion-Based Generative Models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022

  27. [27]

    A connection between score matching and denoising autoencoders

    Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011

  28. [28]

    Flow Matching Guide and Code

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code. arXiv preprint arXiv:2412.06264, 2024

  29. [29]

    Training with noise is equivalent to tikhonov regularization

    Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computa- tion, 7(1):108–116, 1995

  30. [30]

    Ordinary differential equations

    Philip Hartman. Ordinary differential equations. SIAM, 2002

  31. [31]

    Diffusion models for gaussian distributions: Exact solutions and wasserstein errors

    Emile Pierret and Bruno Galerne. Diffusion models for gaussian distributions: Exact solutions and wasserstein errors. arXiv preprint arXiv:2405.14250, 2024

  32. [32]

    Effect of batch learning in multilayer neural networks

    Kenji Fukumizu. Effect of batch learning in multilayer neural networks. Gen, 1(04):1E–03, 1998

  33. [33]

    Toeplitz and circulant matrices: A review

    Robert M Gray et al. Toeplitz and circulant matrices: A review. Foundations and Trends® in Communications and Information Theory, 2(3):155–239, 2006

  34. [34]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations , 2021. URL https://openreview. net/forum?id=PxTIG12RRHS

  35. [35]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

  36. [36]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  37. [37]

    Implicit bias of gradient descent on linear convolutional networks

    Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems, 31, 2018. 11

  38. [38]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  39. [39]

    Blink of an eye: a simple theory for feature localization in generative models

    Marvin Li, Aayush Karan, and Sitan Chen. Blink of an eye: a simple theory for feature localization in generative models. arXiv preprint arXiv:2502.00921, 2025

  40. [40]

    The low-rank simplicity bias in deep networks

    Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, and Phillip Isola. The low-rank simplicity bias in deep networks. arXiv preprint arXiv:2103.10427, 2021

  41. [41]

    Implicit rank-minimizing autoencoder

    Li Jing, Jure Zbontar, et al. Implicit rank-minimizing autoencoder. Advances in Neural Information Processing Systems, 33:14736–14746, 2020

  42. [42]

    Expandnets: Linear over- parameterization to train compact convolutional networks

    Shuxuan Guo, Jose M Alvarez, and Mathieu Salzmann. Expandnets: Linear over- parameterization to train compact convolutional networks. Advances in Neural Information Processing Systems, 33:1298–1310, 2020

  43. [43]

    Inductive bias of multi-channel linear convolutional networks with bounded weight norm

    Meena Jagadeesan, Ilya Razenshteyn, and Suriya Gunasekar. Inductive bias of multi-channel linear convolutional networks with bounded weight norm. In Conference on Learning Theory, pages 2276–2325. PMLR, 2022

  44. [44]

    Geometry of linear convolutional networks

    Kathlén Kohn, Thomas Merkh, Guido Montúfar, and Matthew Trager. Geometry of linear convolutional networks. SIAM Journal on Applied Algebra and Geometry, 6(3):368–406, 2022

  45. [45]

    Function space and critical points of linear convolutional networks.SIAM Journal on Applied Algebra and Geometry, 8(2):333–362, 2024

    Kathlén Kohn, Guido Montúfar, Vahid Shahverdi, and Matthew Trager. Function space and critical points of linear convolutional networks.SIAM Journal on Applied Algebra and Geometry, 8(2):333–362, 2024

  46. [46]

    Deep image prior

    Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018

  47. [47]

    The spectral bias of the deep image prior

    Prithvijit Chakrabarty and Subhransu Maji. The spectral bias of the deep image prior. arXiv preprint arXiv:1912.08905, 2019

  48. [48]

    A bayesian perspective on the deep image prior

    Zezhou Cheng, Matheus Gadelha, Subhransu Maji, and Daniel Sheldon. A bayesian perspective on the deep image prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5443–5451, 2019

  49. [49]

    The convergence rate of neural networks for learned functions of different frequencies

    Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural Information Processing Systems, 32, 2019

  50. [50]

    Frequency bias in neural networks for input of non-uniform density

    Ronen Basri, Meirav Galun, Amnon Geifman, David Jacobs, Yoni Kasten, and Shira Kritchman. Frequency bias in neural networks for input of non-uniform density. In International conference on machine learning, pages 685–694. PMLR, 2020

  51. [51]

    Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks

    Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020

  52. [52]

    Denoising and regularization via exploiting the structural bias of convolutional generators

    Reinhard Heckel and Mahdi Soltanolkotabi. Denoising and regularization via exploiting the structural bias of convolutional generators. arXiv preprint arXiv:1910.14634, 2019

  53. [53]

    Rank-one modification of the symmetric eigenproblem

    James R Bunch, Christopher P Nielsen, and Danny C Sorensen. Rank-one modification of the symmetric eigenproblem. Numerische Mathematik, 31(1):31–48, 1978

  54. [54]

    A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem

    Ming Gu and Stanley C Eisenstat. A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem. SIAM journal on Matrix Analysis and Applications , 15(4): 1266–1276, 1994

  55. [55]

    Some modified matrix eigenvalue problems

    Gene H Golub. Some modified matrix eigenvalue problems. SIAM review, 15(2):318–334, 1973

  56. [56]

    A proposal for Toeplitz matrix calculations

    Gilbert Strang. A proposal for Toeplitz matrix calculations. Studies in Applied Mathematics, 74 (2):171–176, 1986. 12

  57. [57]

    On the solution of circulant linear system

    Mingkui Chen. On the solution of circulant linear system. Technical Report TR-401, Yale University, Department of Computer Science, 1985. URL https://cpsc.yale.edu/sites/ default/files/files/tr401.pdf

  58. [58]

    Presentation slides (mastronardi.pdf)

    Michela Mastronardi. Presentation slides (mastronardi.pdf). https://www.math.unipd.it/ ~michela/2gg07/TALKS/Mastronardi.pdf, 2007. Accessed: 2025-05-23

  59. [59]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d 'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 32. Cur- ran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 3001ef257407d5...

  60. [60]

    decrease

    further analyzed the inductive bias of the linear convolutional network with non-trivial local ker- nel size (neither pointwise nor full image) and multiple channels, and provided analytical statements about the inductive bias. However, they also found less success for closed form solutions for even two-layer convolutional networks with finite kernel widt...