Diffusion Models Memorize in Training -- and Generalize in Inference
Pith reviewed 2026-05-21 10:48 UTC · model grok-4.3
The pith
Diffusion models overfit the denoising objective but generalize in inference because model error shifts sampling trajectories away from training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The flow field generalizes through model error, which moves sampling trajectories outside the domain of noisy training samples and thereby naturally prevents overfitting, even as the model fully memorizes the training data in the denoising objective.
What carries the argument
The denoising flow field, which localizes sharply around training points in its optimal form but is smoothed by model error into a generalizing version.
If this is right
- The generalization gap between training and validation performance is largest at intermediate noise levels.
- Model error suppresses exact recall of individual training points and produces a smooth flow field instead.
- The training generalization gap does not carry over to inference time.
- Generated samples show no strong similarity to training samples despite the objective-level overfitting.
Where Pith is reading between the lines
- The same error-driven smoothing might stabilize other generative models whose training objectives differ from their inference paths.
- Controlled amounts of model error could be deliberately introduced during training to improve generalization without changing the architecture.
- Quantifying trajectory distances in large-scale diffusion models would test how far this separation holds in practice.
Load-bearing premise
The intermediate states of sampling trajectories are sufficiently far from the distribution of noisy training samples the model is trained on.
What would settle it
Measuring the distance of intermediate sampling states to the nearest noisy training samples and checking whether this distance correlates with increased similarity between generated and training samples.
Figures
read the original abstract
Diffusion models generalize well in practice. However, an optimal diffusion model fully memorizes the training data and therefore fails to generalize, raising the question of what induces generalization in a real diffusion model. We show that, despite generalizing at the sample level, diffusion models progressively overfit the denoising training objective and thereby create a generalization gap between the performance on validation and training samples. This gap is most pronounced at intermediate noise levels. Using a fully analytic error-prone toy model, we trace the factors affecting the generalization gap. We find that the optimal denoising flow field localizes sharply around training points, but the model error suppresses the exact recall of training points, yielding a smooth, generalizing flow field. Finally, we find that the generalization gap observed in training does not translate to inference, which would result in a strong similarity between generated samples and training samples. This is because the intermediate states of sampling trajectories are sufficiently far from the distribution of noisy training samples the model is trained on. Together, these findings reveal a novel picture of how diffusion models generalize: the flow field generalizes through model error, which moves sampling trajectories outside the domain of noisy training samples and thereby naturally prevents overfitting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that diffusion models progressively overfit the denoising training objective despite generalizing at the sample level, creating a generalization gap most pronounced at intermediate noise levels. Using a fully analytic error-prone toy model, it traces how the optimal denoising flow field localizes sharply around training points but model error suppresses exact recall to yield a smooth generalizing flow field. The authors conclude that the training generalization gap does not translate to inference because intermediate states of sampling trajectories remain sufficiently far from the distribution of noisy training samples, revealing that the flow field generalizes through model error which naturally prevents overfitting.
Significance. If the result holds, the work offers a mechanistic account of generalization in diffusion models that credits model error for smoothing the flow field and separating inference trajectories from memorized noisy data. The fully analytic toy model is a clear strength, providing reproducible derivations and a falsifiable picture that could inform training practices and architectural choices in the field.
major comments (2)
- [Toy Model Analysis] The construction of the analytic error-prone toy model, its specific error model, and the mapping from toy error structure to high-capacity neural denoisers are not detailed sufficiently. This leaves derivation gaps in verifying the localization of the optimal flow field and the independent smoothing effect of error, which are load-bearing for the central claim that model error induces generalization.
- [Inference and Generalization] The inference claim that sampling trajectories remain outside the support of noisy training samples rests on an untested distance assumption extrapolated from the toy model. Without quantitative checks in neural-network regimes, this premise is insufficient to establish that the observed training/validation gap at intermediate noise levels does not produce overfitting at inference time.
minor comments (2)
- The abstract would benefit from one or two quantitative statements (e.g., measured gap sizes or distance statistics from the toy model) to make the claims more concrete.
- [Toy Model Analysis] Notation for the error term and flow-field localization in the toy-model derivations could be introduced more explicitly to aid readers.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work's significance and for highlighting the value of the fully analytic toy model. We address each major comment below with point-by-point responses, indicating where revisions will be made to strengthen clarity and support for the claims.
read point-by-point responses
-
Referee: [Toy Model Analysis] The construction of the analytic error-prone toy model, its specific error model, and the mapping from toy error structure to high-capacity neural denoisers are not detailed sufficiently. This leaves derivation gaps in verifying the localization of the optimal flow field and the independent smoothing effect of error, which are load-bearing for the central claim that model error induces generalization.
Authors: We agree that the toy model section would benefit from greater explicitness to facilitate independent verification. In the revised manuscript we will expand the relevant section and add a dedicated appendix containing: (i) the complete step-by-step derivation of the optimal denoising flow field and its localization around training points, (ii) the precise mathematical definition of the error model (including how additive perturbations to the score are introduced and scaled with noise level), and (iii) an explicit discussion mapping the toy error structure onto high-capacity neural denoisers by linking finite optimization and capacity constraints to analogous smoothing behavior. These additions will close the noted derivation gaps while preserving the analytic character of the model. revision: yes
-
Referee: [Inference and Generalization] The inference claim that sampling trajectories remain outside the support of noisy training samples rests on an untested distance assumption extrapolated from the toy model. Without quantitative checks in neural-network regimes, this premise is insufficient to establish that the observed training/validation gap at intermediate noise levels does not produce overfitting at inference time.
Authors: The distance claim follows directly from the analytic smoothing of the flow field derived in the toy model: once model error is present, the resulting vector field steers trajectories away from the localized support of noisy training points at intermediate noise levels. We acknowledge that the current manuscript does not supply quantitative distance measurements or trajectory analyses performed with actual neural-network denoisers. In revision we will add a discussion paragraph outlining how the assumption could be tested empirically (e.g., via latent-space distance statistics or controlled low-dimensional network experiments) and will note this as an important direction for follow-up work. We maintain, however, that the toy-model derivation already supplies a mechanistic, falsifiable account consistent with the observed training/validation gap; the absence of large-scale numerical checks does not invalidate the analytic insight but does limit the strength of the extrapolation. revision: partial
- Quantitative validation of the sampling-trajectory distance assumption in high-capacity neural-network regimes
Circularity Check
No significant circularity; toy-model derivations are independent of inference claim
full rationale
The paper's core chain relies on a fully analytic error-prone toy model to derive localization of the optimal denoising flow field around training points and the smoothing effect of model error, both obtained independently via explicit equations rather than by redefining inputs or fitting. The subsequent claim that inference trajectories remain outside the support of noisy training samples follows directly from this error-induced smoothing in the toy setting, without reducing to a self-citation, a fitted parameter renamed as prediction, or an ansatz smuggled through prior work. No load-bearing step equates to its own inputs by construction, and the analysis remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption An analytic error-prone toy model captures the essential localization and smoothing behavior of the denoising flow field in real diffusion models.
- domain assumption Intermediate states along sampling trajectories lie sufficiently far from the distribution of noisy training samples.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
org/abs/2403.00570 Diffusion Models Generalize but Not in the Way You Might Think 13
Adaloglou, N., Kaiser, T., Michels, F., Kollmann, M.: Rethinking cluster- conditioned diffusion models for label-free image synthesis (2024),https://arxiv. org/abs/2403.00570 Diffusion Models Generalize but Not in the Way You Might Think 13
- [4]
-
[5]
Biroli, G., Bonnaire, T., de Bortoli, V., Mézard, M.: Dynamical regimes of diffusion models. Nature Communications15(1) (Nov 2024).https://doi.org/10.1038/ s41467-024-54281-3,http://dx.doi.org/10.1038/s41467-024-54281-3
-
[6]
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets (2023),https: //arxiv.org/abs/2311.15127
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
-
[13]
Nature Biomedical Engineering (2025) https://doi.org/ 10.1038/s41551-025-01468-8
Dar, S.U.H., Seyfarth, M., Ayx, I., Papavassiliu, T., Schoenberg, S.O., Siepmann, R.M., Laqua, F.C., Kahmann, J., Frey, N., Baeßler, B., Foersch, S., Truhn, D., Kather, J.N., Engelhardt, S.: Unconditional latent diffusion models memorize pa- tient imaging data. Nature Biomedical Engineering (2025).https://doi.org/10. 1038/s41551-025-01468-8,https://doi.or...
-
[14]
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis (2021), https://arxiv.org/abs/2105.05233
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Di, J.Z., Lu, Y., Yu, Y., Kamath, G., Dziedzic, A., Boenisch, F.: Demystifying foreground-background memorization in diffusion models (2025),https://arxiv. org/abs/2508.12148
-
[16]
arXiv preprint (2025),https: //arxiv.org/abs/2502.07516
Dutt, R.: The devil is in the prompts: De-identification traces enhance memo- rization risks in synthetic chest x-ray generation. arXiv preprint (2025),https: //arxiv.org/abs/2502.07516
-
[17]
arXiv preprint (2024),https://arxiv.org/abs/2405.19458
Dutt, R., Bohdal, O., Sanchez, P., Tsaftaris, S.A., Hospedales, T.: Memcontrol: Mitigating memorization in diffusion models via automated parameter selection. arXiv preprint (2024),https://arxiv.org/abs/2405.19458
- [18]
- [19]
-
[20]
Fefferman, C., Mitter, S., Narayanan, H.: Testing the manifold hypothesis (2013), https://arxiv.org/abs/1310.0425
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[21]
Gao, W., Li, M.: How do flow matching models memorize and generalize in sample data subspaces? (2024),https://arxiv.org/abs/2410.23594 14 T. Kaiser et al
- [22]
- [23]
-
[24]
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G
Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 393–411. Springer Nature Switzerland, Cham (2025)
work page 2024
- [25]
-
[26]
Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium (2018),https: //arxiv.org/abs/1706.08500
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020)
work page 2020
-
[28]
Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022),https://arxiv.org/ abs/2207.12598
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models (2022)
work page 2022
- [30]
- [31]
- [32]
- [33]
- [34]
- [35]
-
[36]
Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models (2022),https://arxiv.org/abs/2206.00364
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [37]
- [38]
-
[39]
Kingma, D.P., Salimans, T., Poole, B., Ho, J.: Variational diffusion models (2023)
work page 2023
-
[40]
Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: A versatile diffusion model for audio synthesis (2021)
work page 2021
- [41]
- [42]
- [43]
- [44]
- [45]
-
[46]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W.,Howes,R.,Huang,P.Y.,Li,S.W.,Misra,I.,Rabbat,M.,Sharma,V.,Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without su...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Popov,V.,Vovk,I.,Gogoryan,V.,Sadekova,T.,Kudinov,M.:Grad-tts:Adiffusion probabilistic model for text-to-speech (2021)
work page 2021
- [48]
-
[49]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022),https://arxiv.org/abs/ 2112.10752
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [50]
-
[51]
Shah, K., Kalavasis, A., Klivans, A.R., Daras, G.: Does generation require memo- rization? creative diffusion models using ambient diffusion (2025),https://arxiv. org/abs/2502.21278
-
[52]
Sohl-Dickstein, J., Weiss, E.A., Maheswaranathan, N., Ganguli, S.: Deep unsuper- vised learning using nonequilibrium thermodynamics (2015)
work page 2015
- [53]
-
[54]
Somepalli, G., Singla, V., Goldblum, M., Geiping, J., Goldstein, T.: Understanding and mitigating copying in diffusion models (2023),https://arxiv.org/abs/2305. 20086
work page 2023
-
[55]
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution (2020),https://arxiv.org/abs/1907.05600
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[56]
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations (2021)
work page 2021
-
[57]
Stein, G., Cresswell, J.C., Hosseinzadeh, R., Sui, Y., Ross, B.L., Villecroze, V., Liu, Z., Caterini, A.L., Taylor, J.E.T., Loaiza-Ganem, G.: Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models (2023), https://arxiv.org/abs/2306.04675
-
[58]
Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models (2016),https://arxiv.org/abs/1511.01844
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[59]
Ventura, E., Achilli, B., Silvestri, G., Lucibello, C., Ambrogioni, L.: Manifolds, random matrices and spectral gaps: The geometric phases of generative diffusion (2025),https://arxiv.org/abs/2410.05898 16 T. Kaiser et al
-
[60]
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., Guo, Y., Wu, T., Si, C., Jiang, Y., Chen, C., Loy, C.C., Dai, B., Lin, D., Qiao, Y., Liu, Z.: Lavie: High-quality video generation with cascaded latent diffusion models (2023),https://arxiv.org/abs/2309.15103
- [61]
-
[62]
Ye, Z., Zhu, Q., Tao, M., Chen, M.: Provable separations between memorization and generalization in diffusion models (2025),https://arxiv.org/abs/2511. 03202
work page 2025
-
[63]
Yi, Q., Chen, X., Zhang, C., Zhou, Z., Zhu, L., Kong, X.: Diffusion models in text generation: a survey. PeerJ Computer Science10, e1905 (2024).https://doi.org/ 10.7717/peerj-cs.1905,https://doi.org/10.7717/peerj-cs.1905
-
[64]
Yoon,T.,Choi,J.Y.,Kwon,S.,Ryu,E.K.:Diffusionprobabilisticmodelsgeneralize when they fail to memorize. In: ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling (2023),https://openreview.net/forum?id= shciCbSk9h Supplementary Material A Fréchet Distance (FD) with partial inference (early stopping) In our experiments, we found a sig...
work page 2023
-
[65]
We sampled 15 50k random subsets from the training split, with the same class prior as in the 50k validation split
-
[66]
We used that class prior to generate 50k images per run
-
[67]
Train" shows the average results between 20 different subsets of the training data
We computed the FD of the generated set against all available subsets of the training and validation split and averaged the results across subsets. On CIFAR-10/100, we follow the same protocol, but since the validation set is limited to 10k samples, we also limit the training subsets to 10k samples. We still use 50k generated images each time. The results...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.