Diffusion Models, Denoiser Architecture and Creativity
Pith reviewed 2026-05-20 20:37 UTC · model grok-4.3
The pith
Diffusion models generate creative images due to interactions between denoiser architecture and the target data distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The distribution of images produced by a diffusion model is not simply the target distribution but is shaped by an interaction between that distribution and the functional form imposed by the denoiser architecture. For linear, polynomial, and bottleneck architectures the paper supplies explicit mappings from the target distribution to the generated distribution. Empirical variation of the UNet shows that these mappings determine whether outputs are realistic or non-realistic, confirming that architectural inductive bias is the decisive factor.
What carries the argument
The inductive bias of the denoiser architecture, which fixes the exact functional mapping from noisy input to denoised output and thereby determines the effective distribution of generated samples.
Load-bearing premise
The derivations assume the reverse process uses precisely the functional form dictated by the architecture without any alteration from training dynamics or regularization.
What would settle it
Train a linear denoiser on data drawn from a known simple distribution such as a Gaussian mixture and verify whether the empirical distribution of generated samples matches the closed-form expression derived for that architecture.
Figures
read the original abstract
The creativity of diffusion models refers to their ability to generate highly realistic images that are different from their training data. Creativity is somewhat surprising since it is known that if the denoiser used in the diffusion model is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples. In this paper we present empirical and theoretical results that suggest that creativity in diffusion models is due to an interaction between the denoiser architecture and the target distribution. Theoretically, we give explicit forms for the distribution of generated samples as a function of the target distribution and the denoiser architecture for three different denoiser architectures (linear, polynomial, bottleneck). Empirically, we show that small changes in the popular UNET denoiser architecture leads to very different forms of creativity, and these small changes often yield samples that are highly nonrealistic. Taken together, our results show that diffusion models will only be successful if the inductive bias of the denoiser architecture is in strong alignment with the true target distribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that creativity in diffusion models—generating realistic but novel samples unlike training data—arises from the interaction between denoiser architecture inductive bias and the target distribution, rather than from the Bayes-optimal denoiser which would copy samples. It derives explicit closed-form expressions for the distribution of generated samples under three denoiser architectures (linear, polynomial, and bottleneck) as functions of the target distribution. Empirically, small modifications to the standard UNet denoiser are shown to produce markedly different creativity behaviors, frequently yielding highly non-realistic outputs. The authors conclude that diffusion models succeed only when the denoiser's inductive bias is in strong alignment with the true target distribution.
Significance. If the explicit derivations hold and the empirical observations are robust to training details, the work would provide a valuable theoretical lens for understanding why diffusion models avoid pure memorization and how architecture choices control generative behavior. The attempt to obtain closed-form sample distributions for specific architectures is a positive step toward falsifiable analysis in this area.
major comments (2)
- [Theoretical section] Theoretical section (derivations for linear, polynomial, and bottleneck denoisers): the explicit forms treat the reverse process as directly applying the architecture's fixed functional mapping to noisy inputs. This setup is load-bearing for the central claim but omits the effects of training under the noise-prediction or score-matching loss; optimizer choice, regularization, and finite discretization can produce an effective mapping that deviates from the assumed closed-form, potentially allowing the learned denoiser to approximate distributions outside its nominal inductive bias.
- [Empirical results] Empirical evaluation of UNet modifications: the claim that small architecture changes yield very different creativity and often non-realistic samples requires quantitative metrics for creativity and realism, details on the precise modifications (e.g., which layers or connections altered), dataset specifications, and ablation controls to isolate inductive bias from other training factors.
minor comments (2)
- [Theoretical section] Clarify whether the derived distributions are parameter-free or depend on additional fitted quantities; this affects the strength of the 'explicit forms' contribution.
- [Introduction] Add a short discussion of related work on architecture biases in score-based or denoising models to better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments raise valid points about the assumptions underlying our theoretical derivations and the need for greater rigor and detail in the empirical section. We address each major comment below and will revise the manuscript to incorporate clarifications and additional information.
read point-by-point responses
-
Referee: [Theoretical section] Theoretical section (derivations for linear, polynomial, and bottleneck denoisers): the explicit forms treat the reverse process as directly applying the architecture's fixed functional mapping to noisy inputs. This setup is load-bearing for the central claim but omits the effects of training under the noise-prediction or score-matching loss; optimizer choice, regularization, and finite discretization can produce an effective mapping that deviates from the assumed closed-form, potentially allowing the learned denoiser to approximate distributions outside its nominal inductive bias.
Authors: We appreciate this observation. Our closed-form derivations are derived under the assumption that the denoiser realizes the exact functional mapping permitted by its architecture (i.e., the architecture's inductive bias is fully expressed). This isolates the contribution of architecture to the generated distribution, which is the core of our theoretical claim. We agree that real training under score-matching or noise-prediction losses, together with finite steps and optimization, can cause deviations from this ideal mapping. In the revision we will add an explicit discussion of these modeling assumptions, their relation to practical training, and the extent to which the architecture still constrains the reachable distributions even when the mapping is only approximate. revision: yes
-
Referee: [Empirical results] Empirical evaluation of UNet modifications: the claim that small architecture changes yield very different creativity and often non-realistic samples requires quantitative metrics for creativity and realism, details on the precise modifications (e.g., which layers or connections altered), dataset specifications, and ablation controls to isolate inductive bias from other training factors.
Authors: We agree that the empirical claims would be strengthened by additional quantitative support and experimental controls. In the revised manuscript we will: (i) report quantitative metrics for realism (e.g., FID) and creativity (e.g., nearest-neighbor distance to the training set); (ii) provide precise descriptions of the UNet modifications, including which layers or connections were altered; (iii) state the datasets and training protocols used; and (iv) include further ablation experiments that vary architecture while holding other training factors fixed. These additions will make the empirical evidence more reproducible and better isolate the role of inductive bias. revision: yes
Circularity Check
Derivations are self-contained under explicit architecture assumptions; no reduction to inputs by construction
full rationale
The paper derives explicit closed-form distributions of generated samples directly from the target distribution combined with the assumed exact functional mapping of each denoiser architecture (linear, polynomial, bottleneck). These derivations are presented as mathematical consequences of the architecture choice rather than as predictions fitted to data or defined in terms of the output itself. No self-citations are used to justify uniqueness or load-bearing premises, no parameters are fitted to subsets and then relabeled as predictions, and no ansatzes are smuggled via prior work. Empirical observations on UNET variants are separate from the theoretical chain and do not rely on the derivations for their validity. The analysis therefore remains independent of its conclusions under the stated modeling assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption If the denoiser is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples.
Reference graph
Works this paper leans on
-
[1]
Cambridge University Press, 2014
Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014
work page 2014
-
[2]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[3]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer International Publishing, 2015
work page 2015
-
[4]
Weiss, Niru Maheswaranathan, and Surya Ganguli
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of the 32nd International Conference on Machine Learning, pages 2256–2265, 2015
work page 2015
-
[5]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InProceedings of the 34th International Conference on Neural Information Processing Systems, 2020
work page 2020
-
[6]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Samuli Laine, and Timo Aila. Elucidating the design space of diffusion-based generative models. InProceedings of the 36th International Conference on Neural Information Processing Systems, 2022
work page 2022
-
[7]
Generalization in diffusion models arises from geometry-adaptive harmonic representations
Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[8]
Zahra Kadkhodaie, Stéphane Mallat, and Eero Simoncelli. Unconditional cnn denoisers contain sparse semantic representation of images.arXiv preprint arXiv:2506.01912, 2025
-
[9]
An analytic theory of creativity in convolutional diffusion models
Mason Kamb and Surya Ganguli. An analytic theory of creativity in convolutional diffusion models. In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[10]
S. M. Kay.Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, 1997
work page 1997
-
[11]
Deep learning face attributes in the wild
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015
work page 2015
-
[12]
Locality in image diffusion models emerges from data statistics
Artem Lukoianov, Chenyang Yuan, Justin Solomon, and Vincent Sitzmann. Locality in image diffusion models emerges from data statistics. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[13]
Towards a mechanistic explanation of diffusion model generalization
Matthew Niedoba, Berend Zwartsenberg, Kevin Patrick Murphy, and Frank Wood. Towards a mechanistic explanation of diffusion model generalization. InForty-second International Conference on Machine Learning, 2025
work page 2025
- [14]
-
[15]
Chen Zeno, Hila Manor, Greg Ongie, Nir Weinberger, Tomer Michaeli, and Daniel Soudry. When diffusion models memorize: Inductive biases in probability flow of minimum-norm shallow neural nets. In Forty-second International Conference on Machine Learning, 2025. A Theoretical proofs A.1 Proof theorem 4 Let x∈R d be a signal vector with covariance Σx =E[xx T ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.