arxiv: 2605.02908 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

Bumjun Kim , Albert No

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords memorizationstable diffusionCLIP embeddingstext-to-image generationdiffusion modelstoken embeddingsinference-time mitigationimage memorization

0 comments

The pith

Stable Diffusion memorizes training images because CLIP padding embeddings duplicate the end-of-text embedding and amplify its influence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how different parts of the text prompt embeddings affect whether Stable Diffusion generates memorized images from training data. It finds that the actual prompt tokens play a small role in memorized outputs, while the padding tokens have a large effect. This happens because the padding embeddings copy the structure of the end-of-text embedding, which is the only one directly trained in CLIP. As a result, the model over-focuses on this single embedding and reproduces exact training images. The authors show two simple changes at generation time that reduce this effect while keeping image quality intact.

Core claim

In memorized cases, the prompt embeddings contribute minimally to the generated image, whereas the padding embeddings strongly drive memorization because they structurally duplicate the end-of-text embedding. Since the end-of-text embedding is the only one explicitly optimized during CLIP training, this duplication unintentionally increases its influence, causing the diffusion model to over-rely on it and produce memorized outputs.

What carries the argument

The structural duplication of the end-of-text embedding by the padding embeddings in CLIP token sequences fed to Stable Diffusion.

If this is right

Replacing the tokenizer's default padding token with the exclamation mark token and masking the end-of-text embedding reduces memorization.
Partial masking of the padding embeddings also suppresses memorization at inference time.
Both methods work without needing to detect memorized cases in advance and preserve generation quality.
These interventions are simple and can be applied directly during image generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar embedding duplication effects might occur in other text-to-image models that use CLIP or similar text encoders.
Retraining CLIP with more distinct padding embeddings could prevent this memorization pathway.
Monitoring the influence of specific token embeddings during generation could help identify memorization risks earlier.

Load-bearing premise

The correlation between padding embeddings duplicating the end-of-text embedding and higher memorization rates is caused by that duplication rather than by some other property of the padding tokens.

What would settle it

If applying the proposed masking of padding or end-of-text embeddings fails to reduce the frequency of memorized image outputs in Stable Diffusion experiments.

Figures

Figures reproduced from arXiv: 2605.02908 by Albert No, Bumjun Kim.

**Figure 1.** Figure 1: Mitigating memorization via <pad> replacement and v eot masking. We identify <pad> embeddings (v pad) as a major contributor to memorization, as they are implicit duplications of the <endoftext> embedding (v eot). To mitigate this, we replace the tokenizer’s default <pad> (<eot>) with the ! token, and mask v eot. This method effectively suppresses memorization while preserving image quality and prompt alig… view at source ↗

**Figure 2.** Figure 2: A text prompt is first processed by the tokenizer into [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: v pr i are not important. (a) shows that v pr i alone cannot guide generation, as the resulting images collapse when they are the only embeddings retained. In contrast, (b) replacing v pr i with v eot or (c) masking v pr i preserve most of the original structure. results reinforce our earlier observation that token-level influence does not persist after CLIP embedding. Method SSCD CLIPScore Aesthetic (a) … view at source ↗

**Figure 4.** Figure 4: v pad i are important. (d) to (g) show cases where v pad i are preserved, either by replacing v pad i with v eot, remaining only v eot, masking v eot, or remaining only v pad i . All four preserve the structure and semantics of the original image, whereas (h) masking v pad i generates images that deviate significantly from it. Method SSCD CLIPScore Aesthetic (d) 0.85 ± 0.07 0.32 ± 0.01 5.28 ± 0.07 (e) 0.49… view at source ↗

**Figure 5.** Figure 5: While Chen et al. [4] reported a “Bright Ending”, an attention spike on <eot> (specifically on v eot) when memorization occurs, we observe that this effect is not limited to v eot. Multiple <pad> (v pad) adjacent to <eot> also exhibit elevated attention, indicating that they play a substantial role in the memorization process rather than serving as placeholders. sumption that v pad i are semantically irrel… view at source ↗

**Figure 6.** Figure 6: Comparison of mitigation results across five methods using the same prompt and seed. Our method (second column) mitigates memorization while preserving structure and prompt alignment. Ren et al. [24] reduce memorization but introduce distortion, while Wen et al. [36], RTA and RNA [31] often fail to mitigate memorization. Additional examples are in Appendix A.2. Category Method SSCD CLIPScore Aesthetic LPIP… view at source ↗

**Figure 7.** Figure 7: Partial Masking v pad i to mitigate memorization. (b) and (c) show generations where 70% and 100% of the v pad i adjacent to v eot are masked, respectively. Masking 70% empirically strikes a well-balanced tradeoff, substantially reducing memorization with negligible quality degradation, whereas full masking (100%) can often cause collapse or noticeable degradation. ically, we find that masking 70% of v pad… view at source ↗

**Figure 8.** Figure 8: Using our mitigation, which replaces [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: <pad> replacement and v eot masking. The first column (Original) shows the memorized image consistently reproduced from the original embedding regardless of seeds. The remaining five columns (Ours) are generated using our mitigation method with five different random seeds. A.2. Comparison with Prior Mitigation Methods Comparison with Baselines (1) Ren et al. [24], which rescales cross-attention. (2) Wen… view at source ↗

**Figure 11.** Figure 11: (a) Original generation without swapping. (b) Genera [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: “Original” is generated using the default embeddings, while “Ours” applies our mitigation strategy ( [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Each left column shows the reference image, and each [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Text embeddings of “Living in the Light with Ann [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Cross Attention maps of “Mothers influence on her young hippo” [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Cross Attention maps of “Living in the Light with Ann Graham” [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

read the original abstract

Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as <startoftext>, <prompt>, <endoftext> and <pad> with corresponding embeddings $\mathbf{v}^{\mathbf{sot}}, \mathbf{v}^{\mathbf{pr}}, \mathbf{v}^{\mathbf{eot}}, \mathbf{v}^{\mathbf{pad}}$. We discover that $\mathbf{v}^{\mathbf{pr}}$ contribute minimally to generation in memorized cases. In contrast, $\mathbf{v}^{\mathbf{pad}}$ strongly affect memorization due to their structural duplication of $\mathbf{v}^{\mathbf{eot}}$, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of $\mathbf{v}^{\mathbf{eot}}$, causing the model to over-rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference-time mitigation strategies: (1) Replacing the tokenizer's default <pad> from <eot> to the ! token before embedding, and masking the $\mathbf{v}^{\mathbf{eot}}$; (2) Partial masking of $\mathbf{v}^{\mathbf{pad}}$. Both suppress memorization without degrading quality, and are readily deployable without prior detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper traces memorization in Stable Diffusion to padding tokens copying the CLIP end-of-text embedding and offers two cheap inference-time patches, though the causal evidence needs tighter controls.

read the letter

The main thing to know is that this paper argues memorization in Stable Diffusion comes from the padding tokens in CLIP embeddings copying the end-of-text vector, which then dominates the generation process. They back this by breaking prompts into start, prompt, end, and pad categories, and they show two easy changes at inference time that cut down on memorization. The new part is the specific mechanism: the prompt embeddings turn out to have little effect in cases where the model memorizes, while the pads amplify the eot signal because they are structurally the same. This leads to over-reliance on that one embedding. The fixes involve replacing the default pad with an exclamation mark and masking the eot, or partially masking the pad vectors. Both are meant to be plug-and-play without hurting output quality much. This approach has some practical value because it targets a known problem in deployed models with low-cost interventions. It also gives interpretability folks a concrete handle to test on other CLIP-based systems. The token categorization itself is straightforward and could be reused. On the downside, the support for the causal role of the duplication is not airtight. The interventions change the sequence of tokens and their count, so the drop in memorization could come from shifting attention patterns in general rather than specifically breaking the eot copy. A better test would hold the padding length fixed and use a different embedding that doesn't match eot. The abstract and description also lack details on the size of the effect or the exact metrics for quality preservation, which makes it tough to assess how reliable the claims are. Readers working on model safety or mechanistic interpretability of generative models would find this relevant. It is not a complete story, but it points to something testable. I would recommend sending it for peer review so that the authors can add the missing controls and numbers. The core idea is worth following up on.

Referee Report

2 major / 1 minor

Summary. The paper claims that memorization in Stable Diffusion is driven by CLIP embeddings, specifically that prompt embeddings v^pr contribute minimally while pad embeddings v^pad strongly influence memorization by structurally duplicating the eot embedding v^eot (the only one explicitly optimized in CLIP training). This duplication is said to amplify v^eot influence and cause over-reliance leading to memorization. The authors categorize tokens as sot, pr, eot, pad and propose two inference-time mitigations—replacing default pad with ! token plus eot masking, or partial v^pad masking—that suppress memorization without degrading generation quality.

Significance. If the causal attribution holds after proper controls, the work offers a concrete mechanistic explanation for a subset of memorization cases and supplies immediately deployable inference-time fixes. This would be useful for interpretability and safety in text-to-image models. The observational categorization of embedding influence is a useful starting point, but the significance is limited by the absence of quantitative effect sizes and isolating ablations.

major comments (2)

[Abstract] Abstract and mitigation descriptions: the central claim that v^pad duplication of v^eot drives memorization rests on post-hoc token categorization and the two proposed interventions. Neither intervention includes a control that preserves identical padding length and non-prompt token count while substituting a non-eot, non-! embedding for v^pad; therefore the observed memorization reduction cannot be isolated from generic effects of altered sequence length, attention distribution, or effective prompt length.
[Abstract] Abstract: the abstract asserts that the mitigations 'suppress memorization without degrading quality' yet provides no quantitative metrics, error bars, ablation tables, or statistical tests comparing memorization rates and image quality (e.g., FID, CLIP score) before and after intervention. This absence makes it impossible to evaluate the magnitude of the effect or the quality–memorization trade-off.

minor comments (1)

Notation for embeddings (v^sot, v^pr, v^eot, v^pad) should be defined once in a dedicated notation section or table and used consistently; currently the abstract introduces them without subsequent formal definitions or equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the need for stronger causal isolation and quantitative validation of our claims. We address each major comment below and commit to incorporating the suggested controls and metrics in a revised version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and mitigation descriptions: the central claim that v^pad duplication of v^eot drives memorization rests on post-hoc token categorization and the two proposed interventions. Neither intervention includes a control that preserves identical padding length and non-prompt token count while substituting a non-eot, non-! embedding for v^pad; therefore the observed memorization reduction cannot be isolated from generic effects of altered sequence length, attention distribution, or effective prompt length.

Authors: We agree this is a substantive limitation in the current experiments. The proposed mitigations alter the input in ways that could affect attention patterns beyond the specific eot-duplication mechanism. In the revision we will add an explicit control condition that keeps padding length and total non-prompt token count identical but substitutes pad embeddings with those of a neutral, non-special token (e.g., a randomly chosen common vocabulary embedding unrelated to eot or '!'). This will allow direct comparison of memorization rates under matched sequence statistics, thereby isolating the contribution of the eot duplication. revision: yes
Referee: [Abstract] Abstract: the abstract asserts that the mitigations 'suppress memorization without degrading quality' yet provides no quantitative metrics, error bars, ablation tables, or statistical tests comparing memorization rates and image quality (e.g., FID, CLIP score) before and after intervention. This absence makes it impossible to evaluate the magnitude of the effect or the quality–memorization trade-off.

Authors: We acknowledge that the abstract currently states the quality-preserving property without supporting numbers. The manuscript relies primarily on qualitative examples. We will revise the abstract and add a dedicated experimental section containing: (i) memorization rates (exact-match and embedding-similarity thresholds) on a fixed test set of 500 prompts, (ii) FID and CLIP-score distributions with error bars across multiple random seeds, and (iii) statistical tests (e.g., paired t-tests) comparing pre- and post-intervention conditions. These results will be presented in tables and will quantify the observed trade-off. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of embedding influence stand independently

full rationale

The paper presents its core findings as direct empirical observations from experiments on token embeddings in Stable Diffusion: v^pr contribute minimally while v^pad influence memorization via structural duplication of v^eot. These are framed as discoveries from intervention-based measurements rather than any derivation, prediction, or first-principles result that reduces by the paper's own equations to fitted inputs or prior self-citations. The proposed mitigations (pad replacement with ! plus eot masking, or partial pad masking) follow from the observations without invoking uniqueness theorems, ansatzes smuggled via citation, or renaming of known results. No load-bearing step equates the claimed causal mechanism to a self-referential fit or definition, leaving the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the assumption that CLIP training optimizes only the <eot> embedding and that the tokenizer's default padding reuses that vector; no free parameters are introduced, no new entities are postulated, and the axioms are standard embedding arithmetic.

axioms (2)

domain assumption CLIP training optimizes only the <eot> embedding vector
Stated in the abstract as background for why duplication amplifies influence
domain assumption Token embeddings are added or attended to independently during diffusion generation
Implicit in the categorization of v^sot, v^pr, v^eot, v^pad contributions

pith-pipeline@v0.9.0 · 5548 in / 1390 out tokens · 64439 ms · 2026-05-10T20:03:53.449041+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

v^pad strongly affect memorization due to their structural duplication of v^eot, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of v^eot
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Replacing the tokenizer’s default <pad> from <eot> to the ! token before embedding, and masking the v^eot

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 4

work page 2024
[2]

Extracting training data from diffusion models

Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagiel- ski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. InUSENIX Security, 2023. 1, 3

work page 2023
[3]

Towards memorization-free diffusion models

Chen Chen, Daochang Liu, and Chang Xu. Towards memorization-free diffusion models. InCVPR, 2024. 1

work page 2024
[4]

Exploring local memorization in diffusion models via bright ending attention

Chen Chen, Daochang Liu, Mubarak Shah, and Chang Xu. Exploring local memorization in diffusion models via bright ending attention. InICLR, 2025. 1, 3, 4, 5, 6, 8, 2, 7

work page 2025
[5]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InCVPR,

work page
[6]

Consistent diffusion meets tweedie: Training exact ambient diffusion models with noisy data

Giannis Daras, Alex Dimakis, and Constantinos Costis Daskalakis. Consistent diffusion meets tweedie: Training exact ambient diffusion models with noisy data. InICML,

work page
[7]

Ambient diffu- sion: Learning clean distributions from corrupted data

Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gol- lakota, Alex Dimakis, and Adam Klivans. Ambient diffu- sion: Learning clean distributions from corrupted data. In NeurIPS, 2024. 1

work page 2024
[8]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 4

work page 2024
[9]

On memorization in diffusion models

Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, and Ye Wang. On memorization in diffusion models. TMLR, 2025. 3

work page 2025
[10]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InEMNLP, 2021. 4

work page 2021
[11]

Finding nemo: Lo- calizing neurons responsible for memorization in diffusion models

Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, and Franziska Boenisch. Finding nemo: Lo- calizing neurons responsible for memorization in diffusion models. InNeurIPS, 2024. 1, 4

work page 2024
[12]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 1

work page 2020
[13]

Mem- bench: Memorized image trigger prompt dataset for diffu- sion models.TMLR, 2025

Chunsan Hong, Tae-Hyun Oh, and Minhyuk Sung. Mem- bench: Memorized image trigger prompt dataset for diffu- sion models.TMLR, 2025. 4, 2, 5

work page 2025
[14]

Understanding and mitigating memorization in generative models via sharp- ness of probability landscapes

Dongjae Jeon, Dueun Kim, and Albert No. Understanding and mitigating memorization in generative models via sharp- ness of probability landscapes. InICML, 2025. 3, 4

work page 2025
[15]

Ai art and its impact on artists

Harry H Jiang, Lauren Brown, Jessica Cheng, Mehtab Khan, Abhishek Gupta, Deja Workman, Alex Hanna, Johnathan Flowers, and Timnit Gebru. Ai art and its impact on artists. InAIES, 2023. 1

work page 2023
[16]

Image-level memo- rization detection via inversion-based inference perturbation

Yue Jiang, Haokun Lin, Yang Bai, Bo Peng, Zhili Liu, Yuem- ing Lyu, Yong Yang, Jing Dong, et al. Image-level memo- rization detection via inversion-based inference perturbation. InICLR, 2025. 1, 4

work page 2025
[17]

How diffusion models memorize

Juyeop Kim, Songkuk Kim, and Jong-Seok Lee. How dif- fusion models memorize.arXiv preprint arXiv:2509.25705, 2025

work page arXiv 2025
[18]

Finding dori: Memorization in text-to-image diffusion mod- els is not local.arXiv preprint arXiv:2507.16880, 2025

Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, and Franziska Boenisch. Finding dori: Memorization in text-to-image diffusion mod- els is not local.arXiv preprint arXiv:2507.16880, 2025. 4, 2

work page arXiv 2025
[19]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4

work page 2014
[20]

W. H. Orrick. Andersen v. stability ai ltd., 2023.https:// casetext.com/case/andersen- v- stability- ai-ltd. 1

work page 2023
[21]

A self-supervised descriptor for image copy detection

Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self-supervised descriptor for image copy detection. InCVPR, 2022. 4

work page 2022
[22]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 1, 3

work page 2024
[23]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

work page 2021
[24]

Unveiling and mitigating mem- orization in text-to-image diffusion models through cross at- tention

Jie Ren, Yaxin Li, Shenglai Zeng, Han Xu, Lingjuan Lyu, Yue Xing, and Jiliang Tang. Unveiling and mitigating mem- orization in text-to-image diffusion models through cross at- tention. InECCV, 2024. 1, 3, 4, 7, 2

work page 2024
[25]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 1, 3

work page 2022
[26]

A geometric framework for understand- ing memorization in generative models

Brendan Leigh Ross, Hamidreza Kamkari, Zhaoyan Liu, Tongzi Wu, George Stein, Gabriel Loaiza-Ganem, and Jesse C Cresswell. A geometric framework for understand- ing memorization in generative models. InICML 2024 Work- shop on GRaM, 2024. 1

work page 2024
[27]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022. 1

work page 2022
[28]

Gustavosta/Stable-Diffusion-Prompts· Datasets at Hugging Face, 2022

Gustavo Santana. Gustavosta/Stable-Diffusion-Prompts· Datasets at Hugging Face, 2022. 4

work page 2022
[29]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. InNeurIPS, 2022. 4

work page 2022
[30]

Diffusion art or digital forgery? investigating data replication in diffusion models

Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. InCVPR,

work page
[31]

Understanding and mitigating copying in diffusion models

Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Understanding and mitigating copying in diffusion models. InNeurIPS, 2023. 1, 3, 4, 6, 7, 2

work page 2023
[32]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021. 1, 4

work page 2021
[33]

Maximum likelihood training of score-based diffusion mod- els

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion mod- els. InNeurIPS, 2021. 1

work page 2021
[34]

Padding tone: A mechanistic analysis of padding tokens in t2i models

Michael Toker, Ido Galil, Hadas Orgad, Rinon Gal, Yoad Tewel, Gal Chechik, and Yonatan Belinkov. Padding tone: A mechanistic analysis of padding tokens in t2i models. In NAACL, 2025. 6

work page 2025
[35]

A reproducible extraction of training images from diffusion models.arXiv preprint arXiv:2305.08694, 2023

Ryan Webster. A reproducible extraction of training images from diffusion models.arXiv preprint arXiv:2305.08694,

work page arXiv
[36]

De- tecting, explaining, and mitigating memorization in diffusion models

Yuxin Wen, Yuchen Liu, Chen Chen, and Lingjuan Lyu. De- tecting, explaining, and mitigating memorization in diffusion models. InICLR, 2024. 1, 3, 4, 7, 2

work page 2024
[37]

Towards understanding the working mechanism of text-to-image dif- fusion model

Mingyang Yi, Aoxue Li, Yi Xin, and Zhenguo Li. Towards understanding the working mechanism of text-to-image dif- fusion model. InNeurIPS, 2024. 3, 5

work page 2024
[38]

Original

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 7 Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings Supplementary Material A. Additional Experimental Results A.1. Applying Our Method To Stable Diffusion v1.4 We apply...

work page 2018