pith. sign in

arxiv: 2604.23552 · v1 · submitted 2026-04-26 · 💻 cs.LG · cs.AI· stat.ML

On the Memorization of Consistency Distillation for Diffusion Models

Pith reviewed 2026-05-08 06:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords diffusion modelsconsistency distillationmemorizationgeneralizationgenerative modelingtraining dynamicssample qualityneural networks
0
0 comments X

The pith

Consistency distillation reduces memorization transferred from teacher to student diffusion models while preserving sample quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how an extra training phase of consistency distillation changes memorization behavior in diffusion models. When the teacher has already memorized training data, the resulting student shows substantially lower memorization of that data. Sample quality stays the same or improves. A theoretical analysis in a random feature neural network model explains the effect by showing that distillation removes unstable feature directions tied to memorization while keeping stable, generalizable ones. Readers care because distillation is widely used for faster generation, so its side effect on memorization matters for reliable and private deployment of these models.

Core claim

When a teacher diffusion model that has memorized data undergoes consistency distillation, the student model exhibits significantly reduced memorization while sample quality is preserved or improved. The process works because consistency distillation suppresses unstable feature directions associated with memorization and retains stable, generalizable modes, as demonstrated by both empirical results and analysis in a random feature neural network model.

What carries the argument

Consistency distillation process, which suppresses unstable feature directions associated with memorization while preserving stable generalizable modes in the random feature neural network model.

If this is right

  • Distillation improves the memorization-generalization trade-off beyond its role in speeding up sampling.
  • Consistency distillation can serve as a mechanism to reduce transferred memorization from teacher to student.
  • Sample quality metrics remain stable or increase after the distillation step even as memorization decreases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The suppression of unstable features could be tuned by adjusting the number of distillation steps to target memorization more precisely.
  • Similar effects might appear in other acceleration techniques for diffusion models if they also emphasize stable directions.
  • Deployed systems could add a distillation stage specifically to address privacy risks from memorized training data.

Load-bearing premise

The random feature neural network model accurately represents the memorization and generalization dynamics of real diffusion models under consistency distillation.

What would settle it

An experiment in which a student model after consistency distillation shows memorization rates as high as the teacher model, or in which sample quality drops, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.23552 by Bingqing Jiang, Difan Zou.

Figure 1
Figure 1. Figure 1: Student dynamics during consistency distillation on CIFAR-10. Left: mem￾orization ratio. Right: FID. Consistency distillation reliably reduces memorization, while FID improvement depends on the memorization state of the teacher. that of the teacher. In contrast, when the teacher is already in a severely memorization-dominated regime, the student still exhibits much weaker memorization, but its FID remains … view at source ↗
Figure 2
Figure 2. Figure 2: Spectral density of the teacher and consistency distillation curvature operators. (a) The teacher curvature operator U exhibits a separated spectrum, with low-eigenvalue modes associated with memorization and high-eigenvalue modes associated with generalization. (b) The consistency distillation curvature Ucd shows sharp spectral atoms: a dominant spike at λ = β induced by the isotropic shift acting on the … view at source ↗
Figure 3
Figure 3. Figure 3: Mode-wise spectral effects of non-isotropic consistency distillation. Each point corresponds to a teacher eigenmode ui , plotted against its curvature eigenvalue λi(U), with modes partitioned into memorization-associated (Mem) and generalization-associated (Gen) subspaces. (a) The visibility ai = u ⊤ i Sui is uniformly small for Mem modes and significantly larger for Gen modes. (b) The resolvent term bi = … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between progressive distillation and consistency dis￾tillation under different teacher memorization levels trained on 6000 data points. PD samples exhibit noticeably degraded visual fidelity compared with CD, especially under limited data and higher teacher memorization. A Rationale for Adopting Consistency Distillation A.1 Motivation for Consistency Distillation To motivate our choi… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between continuous-time and discrete consistency distillation under limited data. Continuous-time CD exhibits blurred or failed generations, while the discrete objective remains stable under the same training budget. A.2 Motivation for the Discrete Formulation of Consistency Distillation The original formulation of consistency distillation admits a continuous-time extension, obtained as the infi… view at source ↗
Figure 6
Figure 6. Figure 6: Gen-to-Mem leakage after inverse-curvature weighting. Median fracBmem on Gen (Eq. (B.15)) versus ridge γ. We choose γ ⋆ by the minimal-sufficient rule in Eq. (B.17) with tolerance τ = 0.1. Importantly, ridge regularization does not alter the qualitative structure of the PF-ODE operator; it only controls the conditioning of the inverse curvature, which is unavoidable in consistency distillation. Rather than… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of teacher model capacity on memorization and consistency distillation. Each panel reports the evolution of FID (left axis) and memorization ratio (right axis) as a function of training steps for the teacher diffusion model (EDM) and the corresponding consistency-distilled (CD) student. Model capacity is varied along three axes: network width (channels) and depth (number of ResNet blocks). In all ca… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of student–teacher architectural mismatch on memorization and sample quality under two-step consistency distillation. The figure reports FID (left y-axis, solid lines) and memorization ratio (right y-axis, dashed lines) as functions of training steps for different student–teacher architectural configurations. Results are shown for both 10%-memorization and 30%-memorization teachers. Across all setti… view at source ↗
Figure 9
Figure 9. Figure 9: Effect of the ψp/ψn ratio on mode-wise non-isotropic CD response. Each point corresponds to a teacher eigenmode ui , plotted against its curvature eigenvalue λi(U), with modes partitioned into MEM and GEN by a fixed spectral threshold. Across panels, the feature dimension is fixed at ψp = 32, while the effective sample ratio ψn decreases from left to right. As ψp/ψn decreases, the fraction of GEN modes wit… view at source ↗
Figure 10
Figure 10. Figure 10: Spectral separation of the teacher curvature matrix U under different ψp/ψn ra￾tios. Shown are the empirical spectral densities ρ(λ) of U, with modes partitioned into memorization￾associated (MEM) and generalization-associated (GEN) subspaces by a fixed threshold. For ψp = 32 and ψn = 16 (left), the GEN spectrum is well separated and concentrated at significantly larger eigenvalues than the MEM spectrum, … view at source ↗
read the original abstract

Diffusion models are central to modern generative modeling, and understanding how they balance memorization and generalization is critical for reliable deployment. Recent work has shown that memorization in diffusion models is shaped by training dynamics, with generalization and memorization emerging at different stages of training. However, deployed diffusion models are often further distilled, introducing an additional training phase whose impact on memorization is not well understood. In this work, we analyze how distillation reshapes memorization behavior in diffusion models, taking consistency distillation as a representative framework. Empirically, we show that when applied to a teacher model that has memorized data, consistency distillation significantly reduces transferred memorization in the student while preserving, and sometimes improving, sample quality. To explain this behavior, we provide a theoretical analysis using a random feature neural network model [Bonnaire et al., 2025], showing that consistency distillation suppresses unstable feature directions associated with memorization while preserving stable, generalizable modes. Our findings suggest that distillation can serve not only as an acceleration tool, but also as a mechanism for improving the memorization-generalization trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper analyzes the effect of consistency distillation on memorization in diffusion models. Empirically, it demonstrates that applying consistency distillation to a teacher model that has memorized training data produces a student model with significantly reduced transferred memorization while preserving or improving sample quality. Theoretically, it uses a random feature neural network model from prior work to argue that consistency distillation suppresses unstable feature directions associated with memorization while retaining stable, generalizable modes, thereby improving the memorization-generalization tradeoff.

Significance. If the empirical results are robust across models and datasets and the random feature analysis maps meaningfully to real diffusion dynamics, the work would show that distillation can serve as a regularization mechanism beyond acceleration. This has potential value for privacy-sensitive applications of generative models. The mechanistic explanation via feature stability is a conceptual contribution, though its applicability depends on the validity of the modeling assumptions.

major comments (1)
  1. The theoretical explanation in the section on random feature analysis relies on the external model from Bonnaire et al. 2025 to claim that consistency distillation suppresses unstable directions linked to memorization. However, the random feature setup lacks the iterative denoising trajectory, noise scheduling, and U-Net inductive biases of actual diffusion models, so it is not immediate that the identified unstable features correspond to memorization in the score function or sampling process of deployed models. Concrete justification or additional experiments bridging the simplified model to the empirical diffusion setting are needed for the explanation to support the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our analysis of consistency distillation and memorization in diffusion models. The major comment identifies a key limitation in the theoretical section, which we address below with a commitment to revisions that clarify the scope and strengthen the connection to empirical results.

read point-by-point responses
  1. Referee: The theoretical explanation in the section on random feature analysis relies on the external model from Bonnaire et al. 2025 to claim that consistency distillation suppresses unstable directions linked to memorization. However, the random feature setup lacks the iterative denoising trajectory, noise scheduling, and U-Net inductive biases of actual diffusion models, so it is not immediate that the identified unstable features correspond to memorization in the score function or sampling process of deployed models. Concrete justification or additional experiments bridging the simplified model to the empirical diffusion setting are needed for the explanation to support the central claim.

    Authors: We thank the referee for highlighting this important limitation. We agree that the random feature neural network model from Bonnaire et al. (2025) is a controlled abstraction that omits the iterative denoising trajectory, noise scheduling, and U-Net inductive biases present in real diffusion models. Consequently, while the analysis shows that consistency objectives can suppress unstable feature directions associated with memorization in this simplified setting, a direct correspondence to memorization in the score function or sampling process of deployed diffusion models is not automatic. The empirical results in the paper (reduced transferred memorization with preserved or improved sample quality) stand independently and motivate the need for an explanatory mechanism. To address the concern, we will revise the theoretical section to explicitly discuss the modeling assumptions, state the limited scope of the claims, and provide concrete justification by linking the feature-stability insight to observed behaviors in our diffusion experiments (e.g., how consistency distillation regularizes high-frequency or unstable modes). We will also add a dedicated limitations paragraph and, where possible, include supplementary analysis of feature directions in the actual distilled models to better bridge the simplified theory to the empirical diffusion setting. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results and external theoretical model are independent

full rationale

The paper presents empirical observations on memorization reduction under consistency distillation as a separate contribution from its theoretical explanation. The theory explicitly invokes an external random feature neural network model from Bonnaire et al. 2025 rather than deriving the suppression of unstable directions from the paper's own data, fits, or self-citations. No load-bearing step reduces a claimed prediction or uniqueness result to a fitted parameter or prior self-citation by construction. The derivation chain remains self-contained against the cited external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the random feature model serving as a faithful proxy for diffusion model dynamics and on the empirical observations being representative.

axioms (1)
  • domain assumption Random feature neural network model from Bonnaire et al. 2025 captures key memorization dynamics in diffusion models
    Invoked to provide the theoretical explanation for the observed empirical behavior.

pith-pipeline@v0.9.0 · 5488 in / 1085 out tokens · 46485 ms · 2026-05-08T06:31:45.479889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    Understanding

    1, 2, 4, 5 Dongjae Jeon, Dueun Kim, and Albert No. Understanding memorization in generative models via sharpness in probability landscapes.arXiv preprint arXiv:2412.04140, 2024. 1 Puheng Li, Zhong Li, Huishuai Zhang, and Jiang Bian. On the generalization properties of diffusion models.Advances in Neural Information Processing Systems, 2023. 1, 8 Gowthami ...

  2. [2]

    Memorization to generalization: Emergence of diffusion models from associative memory networks

    1, 2, 7, 8 Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J Zaki, Luca Ambrogioni, and Dmitry Krotov. Memorization to generalization: Emergence of diffusion models from associative memory networks. InNew Frontiers in Associative Memories, 2025. 1, 2 Hanyu Wang, Yujin Han, and Difan Zou. On the discrepancy and connection between memorization and generation...

  3. [3]

    Scale-wise distillation of diffusion models.arXiv preprint arXiv:2503.16397, 2025

    2 Zhengyang Geng, Ashwini Pokle, and J Zico Kolter. One-step diffusion distillation via deep equilibrium models.Advances in Neural Information Processing Systems, 2023. 2 Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of...

  4. [4]

    Memorization and regularization in generative diffusion models.arXiv preprint arXiv:2501.15785, 2025

    2 Ricardo Baptista, Agnimitra Dasgupta, Nikola B Kovachki, Assad Oberai, and Andrew M Stuart. Memorization and regularization in generative diffusion models.arXiv preprint arXiv:2501.15785,

  5. [5]

    On the edge of memorization in diffusion models

    2 Sam Buchanan, Druv Pai, Yi Ma, and Valentin De Bortoli. On the edge of memorization in diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2 Chen Zeno, Hila Manor, Greg Ongie, Nir Weinberger, Tomer Michaeli, and Daniel Soudry. When diffusion models memorize: Inductive biases in probability flow of minim...

  6. [6]

    On the feature learning in diffusion models

    7 Andi Han, Wei Huang, Yuan Cao, and Difan Zou. On the feature learning in diffusion models. In The Thirteenth International Conference on Learning Representations, 2025a. 8 Yujin Han, Andi Han, Wei Huang, Chaochao Lu, and Difan Zou. Can diffusion models learn hidden inter-feature rules behind images? InProceedings of the 42nd International Conference on ...

  7. [7]

    It bounds the maximum amplification of small-eigenvalue directions, preventing memorization modes from dominating the response

  8. [8]

    22 0 2 4 6 8 10 0.00 0.25 0.50 0.75 1.00Median fracBmem on Gen Figure 6:Gen-to-Mem leakage after inverse-curvature weighting.Median fracBmem on Gen (Eq.(B.15)) versus ridgeγ

    It restores a meaningful separation between memorization and generalization subspaces by suppressing inverse-weighted leakage. 22 0 2 4 6 8 10 0.00 0.25 0.50 0.75 1.00Median fracBmem on Gen Figure 6:Gen-to-Mem leakage after inverse-curvature weighting.Median fracBmem on Gen (Eq.(B.15)) versus ridgeγ. We choose γ⋆ by the minimal-sufficient rule in Eq.(B.17...