pith. sign in

arxiv: 2605.16415 · v1 · pith:THCZLJGYnew · submitted 2026-05-13 · 💻 cs.CV · cs.LG

Diffusion Models, Denoiser Architecture and Creativity

Pith reviewed 2026-05-20 20:37 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords diffusion modelsdenoiser architecturecreativityinductive biasUNetgenerative modelsimage generationtarget distribution
0
0 comments X

The pith

Diffusion models generate creative images due to interactions between denoiser architecture and the target data distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

If the denoiser were Bayes optimal for the training set, a diffusion model would simply reproduce existing samples. The paper shows instead that creativity arises because the architecture imposes its own inductive biases on the generation process, deviating from the target distribution in controlled ways. Explicit closed-form expressions are derived for the output distribution under linear, polynomial, and bottleneck denoisers. Experiments with small modifications to the UNet architecture produce markedly different and frequently unrealistic samples. The overall finding is that diffusion models work well only when the denoiser architecture's biases align closely with the true data distribution.

Core claim

The distribution of images produced by a diffusion model is not simply the target distribution but is shaped by an interaction between that distribution and the functional form imposed by the denoiser architecture. For linear, polynomial, and bottleneck architectures the paper supplies explicit mappings from the target distribution to the generated distribution. Empirical variation of the UNet shows that these mappings determine whether outputs are realistic or non-realistic, confirming that architectural inductive bias is the decisive factor.

What carries the argument

The inductive bias of the denoiser architecture, which fixes the exact functional mapping from noisy input to denoised output and thereby determines the effective distribution of generated samples.

Load-bearing premise

The derivations assume the reverse process uses precisely the functional form dictated by the architecture without any alteration from training dynamics or regularization.

What would settle it

Train a linear denoiser on data drawn from a known simple distribution such as a Gaussian mixture and verify whether the empirical distribution of generated samples matches the closed-form expression derived for that architecture.

Figures

Figures reproduced from arXiv: 2605.16415 by Itamar Levine, Yair Weiss.

Figure 1
Figure 1. Figure 1: The impact of minor U-Net architectural variations on diffusion model outputs. All models [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Two different data sets with the same mean and covariance (most left and right) and data [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A dataset sampled from a mixture of three Gaussians, and the data sampled from [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A 3 GMM data set, and the data sampled from [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Samples from Pgenerated learn by linear denoiser with bottle neck of size 10 (top row), 100 (second row) and no bottle neck (bottom row). each column have generated with the same starting noise. even after the bias of a linear denoiser, the bottle neck size have big influence over the sampled results [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: First row: samples from Pgenerated learn by linear denoiser over CelebA. second row: samples from Pgenerated learn by linear denoiser over samlped data set from Pgenerated learn by linear denoiser over CelebA. data, y denotes the noisy input data, R is the fixed random matrix, and A is the learned weight matrix. Because we are minimizing the Mean Squared Error (MSE) with respect to the final matrix A, this… view at source ↗
Figure 7
Figure 7. Figure 7: Samples from Pgenerated learn by a polynomial denoiser over CelebA. we used different degree (from up: 3, 9, 27). the last row is the nearest neighbor of the 27-degree polynomial result. When using high degree, the polynomial denoiser create both creativity images and memorized images from train set [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Samples from Pgenerated learn by a patch polynomial denoiser over CelebA. Conversely, as we restricted the capacity in the networks with h = 500 and h = 100, the models avoided strict memorization and instead exhibited distinct forms of generative creativity. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Samples from Pgenerated learn by a bottle neck architecture with h=1000 (up), h=500 (middle) and h=100 (down). as expected, the top row is exact sample from train set. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

The creativity of diffusion models refers to their ability to generate highly realistic images that are different from their training data. Creativity is somewhat surprising since it is known that if the denoiser used in the diffusion model is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples. In this paper we present empirical and theoretical results that suggest that creativity in diffusion models is due to an interaction between the denoiser architecture and the target distribution. Theoretically, we give explicit forms for the distribution of generated samples as a function of the target distribution and the denoiser architecture for three different denoiser architectures (linear, polynomial, bottleneck). Empirically, we show that small changes in the popular UNET denoiser architecture leads to very different forms of creativity, and these small changes often yield samples that are highly nonrealistic. Taken together, our results show that diffusion models will only be successful if the inductive bias of the denoiser architecture is in strong alignment with the true target distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that creativity in diffusion models—generating realistic but novel samples unlike training data—arises from the interaction between denoiser architecture inductive bias and the target distribution, rather than from the Bayes-optimal denoiser which would copy samples. It derives explicit closed-form expressions for the distribution of generated samples under three denoiser architectures (linear, polynomial, and bottleneck) as functions of the target distribution. Empirically, small modifications to the standard UNet denoiser are shown to produce markedly different creativity behaviors, frequently yielding highly non-realistic outputs. The authors conclude that diffusion models succeed only when the denoiser's inductive bias is in strong alignment with the true target distribution.

Significance. If the explicit derivations hold and the empirical observations are robust to training details, the work would provide a valuable theoretical lens for understanding why diffusion models avoid pure memorization and how architecture choices control generative behavior. The attempt to obtain closed-form sample distributions for specific architectures is a positive step toward falsifiable analysis in this area.

major comments (2)
  1. [Theoretical section] Theoretical section (derivations for linear, polynomial, and bottleneck denoisers): the explicit forms treat the reverse process as directly applying the architecture's fixed functional mapping to noisy inputs. This setup is load-bearing for the central claim but omits the effects of training under the noise-prediction or score-matching loss; optimizer choice, regularization, and finite discretization can produce an effective mapping that deviates from the assumed closed-form, potentially allowing the learned denoiser to approximate distributions outside its nominal inductive bias.
  2. [Empirical results] Empirical evaluation of UNet modifications: the claim that small architecture changes yield very different creativity and often non-realistic samples requires quantitative metrics for creativity and realism, details on the precise modifications (e.g., which layers or connections altered), dataset specifications, and ablation controls to isolate inductive bias from other training factors.
minor comments (2)
  1. [Theoretical section] Clarify whether the derived distributions are parameter-free or depend on additional fitted quantities; this affects the strength of the 'explicit forms' contribution.
  2. [Introduction] Add a short discussion of related work on architecture biases in score-based or denoising models to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments raise valid points about the assumptions underlying our theoretical derivations and the need for greater rigor and detail in the empirical section. We address each major comment below and will revise the manuscript to incorporate clarifications and additional information.

read point-by-point responses
  1. Referee: [Theoretical section] Theoretical section (derivations for linear, polynomial, and bottleneck denoisers): the explicit forms treat the reverse process as directly applying the architecture's fixed functional mapping to noisy inputs. This setup is load-bearing for the central claim but omits the effects of training under the noise-prediction or score-matching loss; optimizer choice, regularization, and finite discretization can produce an effective mapping that deviates from the assumed closed-form, potentially allowing the learned denoiser to approximate distributions outside its nominal inductive bias.

    Authors: We appreciate this observation. Our closed-form derivations are derived under the assumption that the denoiser realizes the exact functional mapping permitted by its architecture (i.e., the architecture's inductive bias is fully expressed). This isolates the contribution of architecture to the generated distribution, which is the core of our theoretical claim. We agree that real training under score-matching or noise-prediction losses, together with finite steps and optimization, can cause deviations from this ideal mapping. In the revision we will add an explicit discussion of these modeling assumptions, their relation to practical training, and the extent to which the architecture still constrains the reachable distributions even when the mapping is only approximate. revision: yes

  2. Referee: [Empirical results] Empirical evaluation of UNet modifications: the claim that small architecture changes yield very different creativity and often non-realistic samples requires quantitative metrics for creativity and realism, details on the precise modifications (e.g., which layers or connections altered), dataset specifications, and ablation controls to isolate inductive bias from other training factors.

    Authors: We agree that the empirical claims would be strengthened by additional quantitative support and experimental controls. In the revised manuscript we will: (i) report quantitative metrics for realism (e.g., FID) and creativity (e.g., nearest-neighbor distance to the training set); (ii) provide precise descriptions of the UNet modifications, including which layers or connections were altered; (iii) state the datasets and training protocols used; and (iv) include further ablation experiments that vary architecture while holding other training factors fixed. These additions will make the empirical evidence more reproducible and better isolate the role of inductive bias. revision: yes

Circularity Check

0 steps flagged

Derivations are self-contained under explicit architecture assumptions; no reduction to inputs by construction

full rationale

The paper derives explicit closed-form distributions of generated samples directly from the target distribution combined with the assumed exact functional mapping of each denoiser architecture (linear, polynomial, bottleneck). These derivations are presented as mathematical consequences of the architecture choice rather than as predictions fitted to data or defined in terms of the output itself. No self-citations are used to justify uniqueness or load-bearing premises, no parameters are fitted to subsets and then relabeled as predictions, and no ansatzes are smuggled via prior work. Empirical observations on UNET variants are separate from the theoretical chain and do not rely on the derivations for their validity. The analysis therefore remains independent of its conclusions under the stated modeling assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the background fact that a Bayes-optimal denoiser reproduces training samples exactly and on the modeling choice that each tested architecture (linear, polynomial, bottleneck) induces a distinct effective reverse process; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption If the denoiser is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples.
    Invoked in the abstract as the known baseline that makes creativity surprising.

pith-pipeline@v0.9.0 · 5705 in / 1339 out tokens · 70397 ms · 2026-05-20T20:37:50.397485+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    Cambridge University Press, 2014

    Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014

  2. [2]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021

  3. [3]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer International Publishing, 2015

  4. [4]

    Weiss, Niru Maheswaranathan, and Surya Ganguli

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of the 32nd International Conference on Machine Learning, pages 2256–2265, 2015

  5. [5]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InProceedings of the 34th International Conference on Neural Information Processing Systems, 2020

  6. [6]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Samuli Laine, and Timo Aila. Elucidating the design space of diffusion-based generative models. InProceedings of the 36th International Conference on Neural Information Processing Systems, 2022

  7. [7]

    Generalization in diffusion models arises from geometry-adaptive harmonic representations

    Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations. InThe Twelfth International Conference on Learning Representations, 2024

  8. [8]

    Unconditional cnn denoisers contain sparse semantic representation of images.arXiv preprint arXiv:2506.01912, 2025

    Zahra Kadkhodaie, Stéphane Mallat, and Eero Simoncelli. Unconditional cnn denoisers contain sparse semantic representation of images.arXiv preprint arXiv:2506.01912, 2025

  9. [9]

    An analytic theory of creativity in convolutional diffusion models

    Mason Kamb and Surya Ganguli. An analytic theory of creativity in convolutional diffusion models. In Forty-second International Conference on Machine Learning, 2025

  10. [10]

    S. M. Kay.Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, 1997

  11. [11]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015

  12. [12]

    Locality in image diffusion models emerges from data statistics

    Artem Lukoianov, Chenyang Yuan, Justin Solomon, and Vincent Sitzmann. Locality in image diffusion models emerges from data statistics. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  13. [13]

    Towards a mechanistic explanation of diffusion model generalization

    Matthew Niedoba, Berend Zwartsenberg, Kevin Patrick Murphy, and Frank Wood. Towards a mechanistic explanation of diffusion model generalization. InForty-second International Conference on Machine Learning, 2025

  14. [14]

    Binxu Wang and John J. Vastola. The hidden linear structure in score-based models and its application. arXiv preprint arXiv:2311.10892, 2023

  15. [15]

    When diffusion models memorize: Inductive biases in probability flow of minimum-norm shallow neural nets

    Chen Zeno, Hila Manor, Greg Ongie, Nir Weinberger, Tomer Michaeli, and Daniel Soudry. When diffusion models memorize: Inductive biases in probability flow of minimum-norm shallow neural nets. In Forty-second International Conference on Machine Learning, 2025. A Theoretical proofs A.1 Proof theorem 4 Let x∈R d be a signal vector with covariance Σx =E[xx T ...