Memorisation, convergence and generalisation in generative models
Pith reviewed 2026-05-21 03:15 UTC · model grok-4.3
The pith
Generative models can converge to the data distribution without recovering its principal latent factors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In linear generative models, convergence to the data distribution occurs continuously once the number of samples scales linearly with input dimension, while recovery of the principal latent factors occurs in a sharp transition at larger load. Convergence is insensitive to whether those factors have been recovered. The same distinction between convergence and latent recovery holds when the data has a power-law spectrum, both in experiments with convolutional denoisers and in the diffusion-model measurements of Kadkhodaie et al. Generalization therefore decomposes into matching the bulk distribution and recovering the principal factors, with only the former captured by convergence.
What carries the argument
Analytical separation of the continuous convergence transition from the sharp principal-latent-factor recovery transition in linear generative models.
If this is right
- Convergence requires a number of samples linear in the input dimension.
- Principal latent factor recovery requires substantially more samples and occurs sharply.
- Convergence and latent recovery can be decoupled, so observed convergence does not guarantee recovery of principal factors.
- The same separation between bulk matching and factor recovery persists for power-law spectra in both linear analysis and convolutional experiments.
Where Pith is reading between the lines
- Convergence alone may not be sufficient to confirm that a generative model has captured the most informative directions in the data.
- Separate metrics could be developed to track bulk-distribution matching versus principal-factor recovery.
- The distinction may help explain why some generative models produce realistic samples while still missing key structural variations.
- The separation could be tested in other architectures such as GANs or variational autoencoders to check whether it is architecture-independent.
Load-bearing premise
The exact analytical results assume linear generative models whose transition can be derived exactly; the extension to nonlinear convolutional networks is shown only empirically.
What would settle it
A direct calculation or simulation on linear models in which convergence and principal latent factor recovery occur at the same data load would falsify the claimed separation of scales.
Figures
read the original abstract
Generative neural networks learn how to produce highly realistic images from a large, but finite number of examples - or do they simply memorise their training set? To settle this question, Kadkhodaie, Guth, Simoncelli and Mallat (ICLR '24) trained diffusion models independently on disjoint subsets of a dataset and showed that they converge to nearly the same density when the number of training images is large enough. This result raises two basic questions: how much data do you need for convergence, and what does convergence capture about learning the data distribution? Here, we address these questions by providing an exact analytical characterisation of the transition from memorisation to generalisation in linear generative models. We find that these models memorise at small load, while convergence emerges continuously when the number of samples is linear in the input dimension. Strikingly, we find that convergence is insensitive to recovery of the principal latent factors of the data, which are recovered in a sharp transition. After extending our approach to data with power-law spectra, we find the same distinction between convergence and latent recovery in our experiments with convolutional denoisers and in the data of Kadkhodaie et al. We thus show that generalisation in generative models decomposes into at least two distinct objectives: matching the bulk of the data distribution and recovering the principal latent factors. These objectives correspond to two different distances between true and learnt data distribution, and only the first one is captured by convergence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper provides an exact analytical characterization of the memorization-to-generalization transition in linear generative models, finding that convergence to the data distribution emerges continuously when the number of samples scales linearly with input dimension, while recovery of principal latent factors occurs via a sharp transition. It extends the framework to data with power-law spectra and reports that the same separation between convergence and latent recovery appears in experiments with convolutional denoisers as well as in re-analysis of Kadkhodaie et al. (ICLR 2024) data, concluding that generalization decomposes into at least two distinct objectives: matching the bulk of the distribution and recovering principal factors, which correspond to different distances between true and learned distributions.
Significance. If the analytical derivation holds, the work offers a precise decomposition of generalization in generative models into bulk matching (captured by convergence) versus principal-factor recovery, clarifying why models can converge without fully recovering latent structure. The exact analytical result for linear models and the empirical consistency across power-law spectra, CNN denoisers, and prior data constitute a clear strength, providing falsifiable predictions and a parameter-free separation of scales that can guide future theoretical and experimental work on diffusion and generative models.
major comments (1)
- [Section on extension to power-law spectra and convolutional experiments] The central analytical result is derived only for linear generative models. While the manuscript correctly notes that the extension to convolutional denoisers is empirical, the claim that 'the same distinction between convergence and latent recovery' holds in the nonlinear case (as stated in the abstract and the power-law section) would be strengthened by an intermediate argument showing that the two distances remain decoupled under the specific inductive biases of CNN denoisers; without it, the empirical match risks being coincidental rather than evidence of a general separation of objectives.
minor comments (2)
- [Abstract] The term 'load' is introduced in the abstract without an immediate definition; a brief parenthetical or forward reference to its definition in the linear-model section would improve readability for a broad audience.
- [Linear generative models section] Notation for the two distances between true and learned distributions is introduced late; defining them explicitly alongside the linear-model derivation (e.g., near the statement of the exact transition) would make the decomposition claim easier to track.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript and for the constructive major comment. We address it point by point below, with a view to a minor revision.
read point-by-point responses
-
Referee: [Section on extension to power-law spectra and convolutional experiments] The central analytical result is derived only for linear generative models. While the manuscript correctly notes that the extension to convolutional denoisers is empirical, the claim that 'the same distinction between convergence and latent recovery' holds in the nonlinear case (as stated in the abstract and the power-law section) would be strengthened by an intermediate argument showing that the two distances remain decoupled under the specific inductive biases of CNN denoisers; without it, the empirical match risks being coincidental rather than evidence of a general separation of objectives.
Authors: We agree that a rigorous analytical argument establishing decoupling under CNN inductive biases would constitute stronger evidence. Our manuscript already qualifies the convolutional and Kadkhodaie et al. results as empirical extensions of the exact linear analysis, and the power-law section is likewise presented as an analytical generalization within the linear setting. The observed consistency of the convergence-versus-latent-recovery distinction across four distinct regimes (exact linear theory, power-law spectra, CNN denoisers, and re-analysis of prior diffusion-model data) supplies multiple independent lines of support that the separation is unlikely to be coincidental. Nevertheless, we acknowledge the referee’s point. In revision we will (i) add an explicit caveat in the abstract and power-law section reiterating that the nonlinear evidence is empirical, (ii) include a short discussion of why constructing an intermediate argument for CNNs remains an open theoretical challenge, and (iii) note that the current multi-regime empirical agreement provides falsifiable predictions for future work. This constitutes a partial revision: we retain the existing empirical claims while strengthening the surrounding qualifications. revision: partial
Circularity Check
Analytical derivation for linear models is self-contained with no reduction to inputs
full rationale
The paper provides an exact analytical characterisation of the memorisation-to-generalisation transition specifically for linear generative models, deriving that convergence emerges when samples are linear in input dimension while principal factors recover in a sharp transition. This separation is then checked empirically on power-law spectra, convolutional denoisers, and re-analysis of prior data. No quoted equations or steps show a prediction reducing to a fitted parameter by construction, nor does any load-bearing claim rest on a self-citation chain that itself lacks independent verification. The derivation is therefore self-contained and does not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear generative models admit an exact analytical solution for the memorization-to-convergence transition
Reference graph
Works this paper leans on
-
[1]
Kadkhodaie, Z., Guth, F., Simoncelli, E. P. & Mallat, S.Generalization in diffusion models arises from geometry-adaptive harmonic representationsinThe Twelfth International Conference on Learning Representations(2024) (cit. on pp. 1–3, 9, 10, 17, 27–29)
work page 2024
-
[2]
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
Neyshabur, B., Tomioka, R. & Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614(2014) (cit. on p. 1)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[3]
Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O.Understanding deep learning requires rethinking generalizationinInternational Conference on Learning Representations(2017) (cit. on p. 1)
work page 2017
-
[4]
Neyshabur, B., Bhojanapalli, S., McAllester, D. & Srebro, N. Exploring generalization in deep learning.Advances in neural information processing systems30(2017) (cit. on p. 1)
work page 2017
-
[5]
& Friedman, J.An introduction to statistical learning(2009) (cit
Hastie, T., Tibshirani, R. & Friedman, J.An introduction to statistical learning(2009) (cit. on pp. 1, 10)
work page 2009
-
[6]
Muthukumar, V., Vodrahalli, K., Subramanian, V. & Sahai, A. Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory1,67–83 (2020) (cit. on p. 1)
work page 2020
-
[7]
Bartlett, P. L., Long, P. M., Lugosi, G. & Tsigler, A. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences117,30063–30070 (2020) (cit. on p. 1)
work page 2020
-
[8]
Tsigler, A. & Bartlett, P. L. Benign overfitting in ridge regression.Journal of Machine Learning Research24,1–76 (2023) (cit. on p. 1)
work page 2023
-
[9]
Belkin, M., Hsu, D. J. & Mitra, P. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate.Advances in neural information processing systems31(2018) (cit. on p. 1)
work page 2018
-
[10]
Belkin, M., Hsu, D., Ma, S. & Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences116,15849–15854 (2019) (cit. on p. 1)
work page 2019
-
[11]
d’Ascoli, S., Refinetti, M., Biroli, G. & Krzakala, F.Double trouble in double descent: Bias and variance (s) in the lazy regimeinInternational Conference on Machine Learning(2020), 2280–2290 (cit. on p. 1)
work page 2020
-
[12]
Mei, S. & Montanari, A. The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics75,667–766 (2022) (cit. on p. 1)
work page 2022
-
[13]
Hastie, T., Montanari, A., Rosset, S. & Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics50,949 (2022) (cit. on p. 1)
work page 2022
-
[14]
Gunasekar, S., Lee, J., Soudry, D. & Srebro, N.Characterizing implicit bias in terms of optimization geometryinInternational Conference on Machine Learning(2018), 1832–1841 (cit. on p. 1)
work page 2018
-
[15]
Chizat, L. & Bach, F.Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic lossinConference on learning theory(2020), 1305–1338 (cit. on p. 1)
work page 2020
-
[16]
Carlini, N.et al. Extracting training data from large language modelsin30th USENIX security symposium (USENIX Security 21)(2021), 2633–2650 (cit. on pp. 1, 2)
work page 2021
-
[17]
Somepalli, G., Singla, V., Goldblum, M., Geiping, J. & Goldstein, T. Understanding and mitigating copying in diffusion models.Advances in Neural Information Processing Systems36,47783–47803 (2023) (cit. on pp. 1, 2)
work page 2023
-
[18]
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S.Deep unsupervised learning using nonequilibrium thermodynamicsinInternational conference on machine learning(2015), 2256–2265 (cit. on pp. 2, 9). 12
work page 2015
-
[19]
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models.Advances in neural information processing systems33,6840–6851 (2020) (cit. on pp. 2, 9, 28)
work page 2020
-
[20]
Garnier-Brun, J., Biggio, L., Beltrame, D., Mézard, M. & Saglietti, L. Biased Generalization in Diffusion Models.arXiv:2603.03469(2026) (cit. on pp. 2, 10)
-
[21]
Mendes, V. C.et al.A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization.arXiv:2602.10680(2026) (cit. on pp. 2, 10)
- [22]
-
[23]
Kalaj, S.et al.Random features Hopfield networks generalize retrieval to previously unseen examples.Physica A: Statistical Mechanics and its Applications,130946 (2025) (cit. on p. 3)
work page 2025
-
[24]
Catania, G., Decelle, A., Furtlehner, C. & Seoane, B.A theoretical framework for overfitting in energy-based modelinginICML 2025-Forty-Second International Conference on Machine Learning (2025) (cit. on p. 3)
work page 2025
-
[25]
Favero, A., Sclocchi, A. & Wyart, M. Bigger Isn’t Always Memorizing: Early Stopping Overpara- meterized Diffusion Models.arXiv preprint arXiv:2505.16959(2025) (cit. on p. 3)
-
[26]
Bonnaire, T., Urfin, R., Biroli, G. & Mezard, M.Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in TraininginThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025) (cit. on pp. 3, 11)
work page 2025
-
[27]
George, A. J., Veiga, R. & Macris, N.Denoising Score Matching with Random Features: Insights on Diffusion Models from Precise Learning CurvesinThe 29th International Conference on Artificial Intelligence and Statistics(2026) (cit. on pp. 3, 11)
work page 2026
-
[28]
Wang, B., Zavatone-Veth, J. & Pehlevan, C. A Random Matrix Theory Perspective on the Consistency of Diffusion Models.arXiv:2602.02908(2026) (cit. on pp. 3, 4)
-
[29]
Udell, M. & Townsend, A. Why are big data matrices approximately low rank?SIAM Journal on Mathematics of Data Science1,144–160 (2019) (cit. on p. 3)
work page 2019
-
[30]
Hyvärinen, A., Hurri, J. & Hoyer, P. O.Natural image statistics: A probabilistic approach to early computational vision.(Springer Science & Business Media, 2009) (cit. on pp. 3, 7)
work page 2009
-
[31]
Holland, P. W., Laskey, K. B. & Leinhardt, S. Stochastic blockmodels: First steps.Social networks5, 109–137 (1983) (cit. on pp. 3, 10)
work page 1983
-
[32]
Karrer, B. & Newman, M. E. Stochastic blockmodels and community structure in networks.Physical Review E83,016107 (2011) (cit. on pp. 3, 10)
work page 2011
-
[33]
Lesieur, T., Krzakala, F. & Zdeborová, L. Constrained low-rank matrix estimation: Phase trans- itions, approximate message passing and applications.Journal of Statistical Mechanics: Theory and Experiment2017,073403 (2017) (cit. on pp. 3, 10)
work page 2017
-
[34]
Abbe, E. Community detection and stochastic block models: recent developments.Journal of Machine Learning Research18,1–86 (2018) (cit. on pp. 3, 10)
work page 2018
-
[35]
Johnstone, I. M. On the distribution of the largest eigenvalue in principal components analysis. The Annals of statistics29,295–327 (2001) (cit. on p. 3)
work page 2001
-
[36]
Potters, M. & Bouchaud, J.-P.A first course in random matrix theory: for physicists, engineers and data scientists(Cambridge University Press, 2020) (cit. on pp. 3, 19, 21)
work page 2020
-
[37]
Baldi, P. & Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima.Neural networks2,53–58 (1989) (cit. on p. 3)
work page 1989
-
[38]
Saxe, A. M., McClelland, J. L. & Ganguli, S.Exact solutions to the nonlinear dynamics of learning in deep linear neural networksinInternational Conference on Learning Representations(2014) (cit. on p. 3). 13
work page 2014
-
[39]
Saxe, A. M., McClelland, J. L. & Ganguli, S. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences116,11537–11546 (2019) (cit. on p. 3)
work page 2019
-
[40]
Nam, Y.et al. Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phe- nomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)inProceedings of the 42nd International Conference on Machine Learning(eds Singh, A.et al.)267(2025), 81897–81929 (cit. on p. 3)
work page 2025
-
[41]
Merger, C. & Goldt, S. Generalization dynamics of linear diffusion models.arXiv preprint arXiv:2505.24769 (2025) (cit. on p. 3)
-
[42]
Wang, B. & Pehlevan, C. An analytical theory of spectral bias in the learning dynamics of diffusion models.Advances in Neural Information Processing Systems38,95865–95963 (2026) (cit. on p. 3)
work page 2026
-
[43]
Baik, J., Ben Arous, G. & Péché, S. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.Annals of probability33,1643–1697 (2005) (cit. on pp. 5, 10, 20)
work page 2005
-
[44]
Edwards, S. F. & Jones, R. C. The eigenvalue spectrum of a large symmetric random matrix.Journal of Physics A: Mathematical and General9,1595–1603 (1976) (cit. on p. 5)
work page 1976
-
[45]
Benaych-Georges, F. & Nadakuditi, R. R. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices.Advances in Mathematics227,494–521 (2011) (cit. on pp. 5, 20)
work page 2011
-
[46]
Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R. & Rohde, G. inAdvances in Neural Information Processing Systems 32261–272 (2019) (cit. on pp. 6, 23)
work page 2019
-
[47]
Mohan, S., Kadkhodaie, Z., Simoncelli, E. P. & Fernandez-Granda, C.Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural NetworksinInternational Conference on Learning Representations(2020) (cit. on p. 9)
work page 2020
-
[48]
Kadkhodaie, Z. & Simoncelli, E. P. Solving linear inverse problems using the prior implicit in a denoiser.arXiv:2007.13640(2020) (cit. on pp. 9, 28)
-
[49]
Montanari, A. & Richard, E. A statistical model for tensor PCA.Advances in neural information processing systems27(2014) (cit. on p. 10)
work page 2014
- [50]
-
[51]
Ricci, F., Bardone, L. & Goldt, S.Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensionsinForty-second International Conference on Machine Learning (2025) (cit. on p. 10)
work page 2025
-
[52]
Feld, S. L. Why your friends have more friends than you do.American journal of sociology96, 1464–1477 (1991) (cit. on p. 10)
work page 1991
-
[53]
Vapnik, V. & Chervonenkis, A.Theory of Pattern Recognition: Statistical Learning ProblemsIn Russian (Nauka, 1974) (cit. on p. 10)
work page 1974
-
[54]
Kamb, M. & Ganguli, S.An analytic theory of creativity in convolutional diffusion modelsinForty- second International Conference on Machine Learning(2025) (cit. on p. 10)
work page 2025
-
[55]
Van den Burg, G. & Williams, C. On memorization in probabilistic deep generative models.Advances in neural information processing systems34,27916–27928 (2021) (cit. on p. 10)
work page 2021
-
[56]
Liao, Z. & Couillet, R.On the spectrum of random features maps of high dimensional datain International Conference on Machine Learning(2018), 3063–3071 (cit. on p. 11)
work page 2018
-
[57]
Seddik, M., Tamaazousti, M. & Couillet, R.Kernel Random Matrices of Large Concentrated Data: the Example of GAN-Generated ImagesinICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2019), 7480–7484 (cit. on p. 11). 14
work page 2019
-
[58]
Mei, S. & Montanari, A. The Generalization Error of Random Features Regression: Precise Asymp- totics and the Double Descent Curve.Communications on Pure and Applied Mathematics(2021) (cit. on p. 11)
work page 2021
-
[59]
Schölkopf, B., Smola, A. & Müller, K.-R.Kernel principal component analysisinInternational confer- ence on artificial neural networks(1997), 583–588 (cit. on p. 11)
work page 1997
-
[60]
Goldt, S., Mézard, M., Krzakala, F. & Zdeborová, L. Modeling the influence of data structure on learning in neural networks: The hidden manifold model.Phys. Rev. X10,041044 (2020) (cit. on p. 11)
work page 2020
-
[61]
Hu, H. & Lu, Y. M. Universality laws for high-dimensional learning with random features.IEEE Transactions on Information Theory69,1932–1964 (2022) (cit. on p. 11)
work page 1932
-
[62]
Goldt, S.et al. The Gaussian equivalence of generative models for learning with two-layer neural networksinMathematical and Scientific Machine Learning(2021) (cit. on p. 11)
work page 2021
-
[63]
Loureiro, B.et al. Learning curves of generic features maps for realistic datasets with a teacher-student modelinAdvances in Neural Information Processing Systems34(2021) (cit. on p. 11)
work page 2021
-
[64]
Refinetti, M. & Goldt, S.The dynamics of representation learning in shallow, non-linear autoencoders inInternational Conference on Machine Learning(2022), 18499–18519 (cit. on p. 11)
work page 2022
-
[65]
Cui, H. & Zdeborová, L. High-dimensional asymptotics of denoising autoencoders.Advances in Neural Information Processing Systems36,11850–11890 (2023) (cit. on p. 11)
work page 2023
-
[66]
Cui, H., Krzakala, F., Vanden-Eijnden, E. & Zdeborová, L.Analysis of learning a flow-based generative model from limited sample complexityinInternational Conference on Learning Representations2024 (2024), 51929–51955 (cit. on p. 11)
work page 2024
-
[67]
Bardone, L., Merger, C. & Goldt, S.A theory of learning data statistics in diffusion models, from easy to hardinForty-third International Conference on Machine Learning(2026) (cit. on p. 11)
work page 2026
-
[68]
Biroli, G., Bonnaire, T., De Bortoli, V. & Mézard, M. Dynamical regimes of diffusion models.Nature Communications15,9957 (2024) (cit. on p. 11)
work page 2024
-
[69]
George, A. J., Veiga, R. & Macris, N.Analysis of diffusion models for manifold datain2025 IEEE International Symposium on Information Theory (ISIT)(2025), 1–6 (cit. on p. 11)
work page 2025
-
[70]
Achilli, B., Benedetti, M., Biroli, G. & Mézard, M. Theory of speciation transitions in diffusion models with general class structure.Journal of Statistical Mechanics: Theory and Experiment2026, 043304 (2026) (cit. on p. 11)
work page 2026
-
[71]
Vershynin, R.High-dimensional probability: An introduction with applications in data science(Cam- bridge university press, 2018) (cit. on p. 17)
work page 2018
-
[72]
Marchenko, V. A. & Pastur, L. A. Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik114,507–536 (1967) (cit. on p. 19)
work page 1967
-
[73]
Bandeira, A. S., Singer, A. & Strohmer, T.Topics in Mathematics of Data Science(2025) (cit. on pp. 24, 25)
work page 2025
-
[74]
Liu, Z., Luo, P., Wang, X. & Tang, X.Deep Learning Face Attributes in the WildinProceedings of International Conference on Computer Vision (ICCV)(2015) (cit. on p. 25). 15 SUPPLEMENTAL MATERIAL A Memorisation and generalisation in linear generative models In this section, we give a detailed analysis of memorisation, convergence and subspace recovery in li...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.