pith. machine review for the scientific record. sign in

arxiv: 2605.02908 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords memorizationstable diffusionCLIP embeddingstext-to-image generationdiffusion modelstoken embeddingsinference-time mitigationimage memorization
0
0 comments X

The pith

Stable Diffusion memorizes training images because CLIP padding embeddings duplicate the end-of-text embedding and amplify its influence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how different parts of the text prompt embeddings affect whether Stable Diffusion generates memorized images from training data. It finds that the actual prompt tokens play a small role in memorized outputs, while the padding tokens have a large effect. This happens because the padding embeddings copy the structure of the end-of-text embedding, which is the only one directly trained in CLIP. As a result, the model over-focuses on this single embedding and reproduces exact training images. The authors show two simple changes at generation time that reduce this effect while keeping image quality intact.

Core claim

In memorized cases, the prompt embeddings contribute minimally to the generated image, whereas the padding embeddings strongly drive memorization because they structurally duplicate the end-of-text embedding. Since the end-of-text embedding is the only one explicitly optimized during CLIP training, this duplication unintentionally increases its influence, causing the diffusion model to over-rely on it and produce memorized outputs.

What carries the argument

The structural duplication of the end-of-text embedding by the padding embeddings in CLIP token sequences fed to Stable Diffusion.

If this is right

  • Replacing the tokenizer's default padding token with the exclamation mark token and masking the end-of-text embedding reduces memorization.
  • Partial masking of the padding embeddings also suppresses memorization at inference time.
  • Both methods work without needing to detect memorized cases in advance and preserve generation quality.
  • These interventions are simple and can be applied directly during image generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar embedding duplication effects might occur in other text-to-image models that use CLIP or similar text encoders.
  • Retraining CLIP with more distinct padding embeddings could prevent this memorization pathway.
  • Monitoring the influence of specific token embeddings during generation could help identify memorization risks earlier.

Load-bearing premise

The correlation between padding embeddings duplicating the end-of-text embedding and higher memorization rates is caused by that duplication rather than by some other property of the padding tokens.

What would settle it

If applying the proposed masking of padding or end-of-text embeddings fails to reduce the frequency of memorized image outputs in Stable Diffusion experiments.

Figures

Figures reproduced from arXiv: 2605.02908 by Albert No, Bumjun Kim.

Figure 1
Figure 1. Figure 1: Mitigating memorization via <pad> replacement and v eot masking. We identify <pad> embeddings (v pad) as a major contributor to memorization, as they are implicit duplications of the <endoftext> embedding (v eot). To mitigate this, we replace the tokenizer’s default <pad> (<eot>) with the ! token, and mask v eot. This method effectively suppresses memorization while preserving image quality and prompt alig… view at source ↗
Figure 2
Figure 2. Figure 2: A text prompt is first processed by the tokenizer into [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: v pr i are not important. (a) shows that v pr i alone cannot guide generation, as the resulting images collapse when they are the only embeddings retained. In contrast, (b) replacing v pr i with v eot or (c) masking v pr i preserve most of the original structure. results reinforce our earlier observation that token-level in￾fluence does not persist after CLIP embedding. Method SSCD CLIPScore Aesthetic (a) … view at source ↗
Figure 4
Figure 4. Figure 4: v pad i are important. (d) to (g) show cases where v pad i are preserved, either by replacing v pad i with v eot, remaining only v eot, masking v eot, or remaining only v pad i . All four preserve the structure and semantics of the original image, whereas (h) masking v pad i generates images that deviate significantly from it. Method SSCD CLIPScore Aesthetic (d) 0.85 ± 0.07 0.32 ± 0.01 5.28 ± 0.07 (e) 0.49… view at source ↗
Figure 5
Figure 5. Figure 5: While Chen et al. [4] reported a “Bright Ending”, an attention spike on <eot> (specifically on v eot) when memorization occurs, we observe that this effect is not limited to v eot. Multiple <pad> (v pad) adjacent to <eot> also exhibit elevated attention, indicating that they play a substantial role in the memorization process rather than serving as placeholders. sumption that v pad i are semantically irrel… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of mitigation results across five methods using the same prompt and seed. Our method (second column) mitigates memorization while preserving structure and prompt alignment. Ren et al. [24] reduce memorization but introduce distortion, while Wen et al. [36], RTA and RNA [31] often fail to mitigate memorization. Additional examples are in Appendix A.2. Category Method SSCD CLIPScore Aesthetic LPIP… view at source ↗
Figure 7
Figure 7. Figure 7: Partial Masking v pad i to mitigate memorization. (b) and (c) show generations where 70% and 100% of the v pad i adjacent to v eot are masked, respectively. Masking 70% empirically strikes a well-balanced tradeoff, substantially reducing memorization with negligible quality degradation, whereas full masking (100%) can often cause collapse or noticeable degradation. ically, we find that masking 70% of v pad… view at source ↗
Figure 8
Figure 8. Figure 8: Using our mitigation, which replaces [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: <pad> replacement and v eot masking. The first col￾umn (Original) shows the memorized image consistently repro￾duced from the original embedding regardless of seeds. The re￾maining five columns (Ours) are generated using our mitigation method with five different random seeds. A.2. Comparison with Prior Mitigation Methods Comparison with Baselines (1) Ren et al. [24], which rescales cross-attention. (2) Wen… view at source ↗
Figure 11
Figure 11. Figure 11: (a) Original generation without swapping. (b) Genera [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: “Original” is generated using the default embeddings, while “Ours” applies our mitigation strategy ( [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Each left column shows the reference image, and each [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Text embeddings of “Living in the Light with Ann [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cross Attention maps of “Mothers influence on her young hippo” [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Cross Attention maps of “Living in the Light with Ann Graham” [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
read the original abstract

Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as <startoftext>, <prompt>, <endoftext> and <pad> with corresponding embeddings $\mathbf{v}^{\mathbf{sot}}, \mathbf{v}^{\mathbf{pr}}, \mathbf{v}^{\mathbf{eot}}, \mathbf{v}^{\mathbf{pad}}$. We discover that $\mathbf{v}^{\mathbf{pr}}$ contribute minimally to generation in memorized cases. In contrast, $\mathbf{v}^{\mathbf{pad}}$ strongly affect memorization due to their structural duplication of $\mathbf{v}^{\mathbf{eot}}$, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of $\mathbf{v}^{\mathbf{eot}}$, causing the model to over-rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference-time mitigation strategies: (1) Replacing the tokenizer's default <pad> from <eot> to the ! token before embedding, and masking the $\mathbf{v}^{\mathbf{eot}}$; (2) Partial masking of $\mathbf{v}^{\mathbf{pad}}$. Both suppress memorization without degrading quality, and are readily deployable without prior detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that memorization in Stable Diffusion is driven by CLIP embeddings, specifically that prompt embeddings v^pr contribute minimally while pad embeddings v^pad strongly influence memorization by structurally duplicating the eot embedding v^eot (the only one explicitly optimized in CLIP training). This duplication is said to amplify v^eot influence and cause over-reliance leading to memorization. The authors categorize tokens as sot, pr, eot, pad and propose two inference-time mitigations—replacing default pad with ! token plus eot masking, or partial v^pad masking—that suppress memorization without degrading generation quality.

Significance. If the causal attribution holds after proper controls, the work offers a concrete mechanistic explanation for a subset of memorization cases and supplies immediately deployable inference-time fixes. This would be useful for interpretability and safety in text-to-image models. The observational categorization of embedding influence is a useful starting point, but the significance is limited by the absence of quantitative effect sizes and isolating ablations.

major comments (2)
  1. [Abstract] Abstract and mitigation descriptions: the central claim that v^pad duplication of v^eot drives memorization rests on post-hoc token categorization and the two proposed interventions. Neither intervention includes a control that preserves identical padding length and non-prompt token count while substituting a non-eot, non-! embedding for v^pad; therefore the observed memorization reduction cannot be isolated from generic effects of altered sequence length, attention distribution, or effective prompt length.
  2. [Abstract] Abstract: the abstract asserts that the mitigations 'suppress memorization without degrading quality' yet provides no quantitative metrics, error bars, ablation tables, or statistical tests comparing memorization rates and image quality (e.g., FID, CLIP score) before and after intervention. This absence makes it impossible to evaluate the magnitude of the effect or the quality–memorization trade-off.
minor comments (1)
  1. Notation for embeddings (v^sot, v^pr, v^eot, v^pad) should be defined once in a dedicated notation section or table and used consistently; currently the abstract introduces them without subsequent formal definitions or equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the need for stronger causal isolation and quantitative validation of our claims. We address each major comment below and commit to incorporating the suggested controls and metrics in a revised version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and mitigation descriptions: the central claim that v^pad duplication of v^eot drives memorization rests on post-hoc token categorization and the two proposed interventions. Neither intervention includes a control that preserves identical padding length and non-prompt token count while substituting a non-eot, non-! embedding for v^pad; therefore the observed memorization reduction cannot be isolated from generic effects of altered sequence length, attention distribution, or effective prompt length.

    Authors: We agree this is a substantive limitation in the current experiments. The proposed mitigations alter the input in ways that could affect attention patterns beyond the specific eot-duplication mechanism. In the revision we will add an explicit control condition that keeps padding length and total non-prompt token count identical but substitutes pad embeddings with those of a neutral, non-special token (e.g., a randomly chosen common vocabulary embedding unrelated to eot or '!'). This will allow direct comparison of memorization rates under matched sequence statistics, thereby isolating the contribution of the eot duplication. revision: yes

  2. Referee: [Abstract] Abstract: the abstract asserts that the mitigations 'suppress memorization without degrading quality' yet provides no quantitative metrics, error bars, ablation tables, or statistical tests comparing memorization rates and image quality (e.g., FID, CLIP score) before and after intervention. This absence makes it impossible to evaluate the magnitude of the effect or the quality–memorization trade-off.

    Authors: We acknowledge that the abstract currently states the quality-preserving property without supporting numbers. The manuscript relies primarily on qualitative examples. We will revise the abstract and add a dedicated experimental section containing: (i) memorization rates (exact-match and embedding-similarity thresholds) on a fixed test set of 500 prompts, (ii) FID and CLIP-score distributions with error bars across multiple random seeds, and (iii) statistical tests (e.g., paired t-tests) comparing pre- and post-intervention conditions. These results will be presented in tables and will quantify the observed trade-off. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of embedding influence stand independently

full rationale

The paper presents its core findings as direct empirical observations from experiments on token embeddings in Stable Diffusion: v^pr contribute minimally while v^pad influence memorization via structural duplication of v^eot. These are framed as discoveries from intervention-based measurements rather than any derivation, prediction, or first-principles result that reduces by the paper's own equations to fitted inputs or prior self-citations. The proposed mitigations (pad replacement with ! plus eot masking, or partial pad masking) follow from the observations without invoking uniqueness theorems, ansatzes smuggled via citation, or renaming of known results. No load-bearing step equates the claimed causal mechanism to a self-referential fit or definition, leaving the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the assumption that CLIP training optimizes only the <eot> embedding and that the tokenizer's default padding reuses that vector; no free parameters are introduced, no new entities are postulated, and the axioms are standard embedding arithmetic.

axioms (2)
  • domain assumption CLIP training optimizes only the <eot> embedding vector
    Stated in the abstract as background for why duplication amplifies influence
  • domain assumption Token embeddings are added or attended to independently during diffusion generation
    Implicit in the categorization of v^sot, v^pr, v^eot, v^pad contributions

pith-pipeline@v0.9.0 · 5548 in / 1390 out tokens · 64439 ms · 2026-05-10T20:03:53.449041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 4

  2. [2]

    Extracting training data from diffusion models

    Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagiel- ski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. InUSENIX Security, 2023. 1, 3

  3. [3]

    Towards memorization-free diffusion models

    Chen Chen, Daochang Liu, and Chang Xu. Towards memorization-free diffusion models. InCVPR, 2024. 1

  4. [4]

    Exploring local memorization in diffusion models via bright ending attention

    Chen Chen, Daochang Liu, Mubarak Shah, and Chang Xu. Exploring local memorization in diffusion models via bright ending attention. InICLR, 2025. 1, 3, 4, 5, 6, 8, 2, 7

  5. [5]

    Reproducible scal- ing laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InCVPR,

  6. [6]

    Consistent diffusion meets tweedie: Training exact ambient diffusion models with noisy data

    Giannis Daras, Alex Dimakis, and Constantinos Costis Daskalakis. Consistent diffusion meets tweedie: Training exact ambient diffusion models with noisy data. InICML,

  7. [7]

    Ambient diffu- sion: Learning clean distributions from corrupted data

    Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gol- lakota, Alex Dimakis, and Adam Klivans. Ambient diffu- sion: Learning clean distributions from corrupted data. In NeurIPS, 2024. 1

  8. [8]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 4

  9. [9]

    On memorization in diffusion models

    Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, and Ye Wang. On memorization in diffusion models. TMLR, 2025. 3

  10. [10]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InEMNLP, 2021. 4

  11. [11]

    Finding nemo: Lo- calizing neurons responsible for memorization in diffusion models

    Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, and Franziska Boenisch. Finding nemo: Lo- calizing neurons responsible for memorization in diffusion models. InNeurIPS, 2024. 1, 4

  12. [12]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 1

  13. [13]

    Mem- bench: Memorized image trigger prompt dataset for diffu- sion models.TMLR, 2025

    Chunsan Hong, Tae-Hyun Oh, and Minhyuk Sung. Mem- bench: Memorized image trigger prompt dataset for diffu- sion models.TMLR, 2025. 4, 2, 5

  14. [14]

    Understanding and mitigating memorization in generative models via sharp- ness of probability landscapes

    Dongjae Jeon, Dueun Kim, and Albert No. Understanding and mitigating memorization in generative models via sharp- ness of probability landscapes. InICML, 2025. 3, 4

  15. [15]

    Ai art and its impact on artists

    Harry H Jiang, Lauren Brown, Jessica Cheng, Mehtab Khan, Abhishek Gupta, Deja Workman, Alex Hanna, Johnathan Flowers, and Timnit Gebru. Ai art and its impact on artists. InAIES, 2023. 1

  16. [16]

    Image-level memo- rization detection via inversion-based inference perturbation

    Yue Jiang, Haokun Lin, Yang Bai, Bo Peng, Zhili Liu, Yuem- ing Lyu, Yong Yang, Jing Dong, et al. Image-level memo- rization detection via inversion-based inference perturbation. InICLR, 2025. 1, 4

  17. [17]

    How diffusion models memorize

    Juyeop Kim, Songkuk Kim, and Jong-Seok Lee. How dif- fusion models memorize.arXiv preprint arXiv:2509.25705, 2025

  18. [18]

    Finding dori: Memorization in text-to-image diffusion mod- els is not local.arXiv preprint arXiv:2507.16880, 2025

    Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, and Franziska Boenisch. Finding dori: Memorization in text-to-image diffusion mod- els is not local.arXiv preprint arXiv:2507.16880, 2025. 4, 2

  19. [19]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4

  20. [20]

    W. H. Orrick. Andersen v. stability ai ltd., 2023.https:// casetext.com/case/andersen- v- stability- ai-ltd. 1

  21. [21]

    A self-supervised descriptor for image copy detection

    Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self-supervised descriptor for image copy detection. InCVPR, 2022. 4

  22. [22]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 1, 3

  23. [23]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

  24. [24]

    Unveiling and mitigating mem- orization in text-to-image diffusion models through cross at- tention

    Jie Ren, Yaxin Li, Shenglai Zeng, Han Xu, Lingjuan Lyu, Yue Xing, and Jiliang Tang. Unveiling and mitigating mem- orization in text-to-image diffusion models through cross at- tention. InECCV, 2024. 1, 3, 4, 7, 2

  25. [25]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 1, 3

  26. [26]

    A geometric framework for understand- ing memorization in generative models

    Brendan Leigh Ross, Hamidreza Kamkari, Zhaoyan Liu, Tongzi Wu, George Stein, Gabriel Loaiza-Ganem, and Jesse C Cresswell. A geometric framework for understand- ing memorization in generative models. InICML 2024 Work- shop on GRaM, 2024. 1

  27. [27]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022. 1

  28. [28]

    Gustavosta/Stable-Diffusion-Prompts· Datasets at Hugging Face, 2022

    Gustavo Santana. Gustavosta/Stable-Diffusion-Prompts· Datasets at Hugging Face, 2022. 4

  29. [29]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. InNeurIPS, 2022. 4

  30. [30]

    Diffusion art or digital forgery? investigating data replication in diffusion models

    Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. InCVPR,

  31. [31]

    Understanding and mitigating copying in diffusion models

    Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Understanding and mitigating copying in diffusion models. InNeurIPS, 2023. 1, 3, 4, 6, 7, 2

  32. [32]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021. 1, 4

  33. [33]

    Maximum likelihood training of score-based diffusion mod- els

    Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion mod- els. InNeurIPS, 2021. 1

  34. [34]

    Padding tone: A mechanistic analysis of padding tokens in t2i models

    Michael Toker, Ido Galil, Hadas Orgad, Rinon Gal, Yoad Tewel, Gal Chechik, and Yonatan Belinkov. Padding tone: A mechanistic analysis of padding tokens in t2i models. In NAACL, 2025. 6

  35. [35]

    A reproducible extraction of training images from diffusion models.arXiv preprint arXiv:2305.08694, 2023

    Ryan Webster. A reproducible extraction of training images from diffusion models.arXiv preprint arXiv:2305.08694,

  36. [36]

    De- tecting, explaining, and mitigating memorization in diffusion models

    Yuxin Wen, Yuchen Liu, Chen Chen, and Lingjuan Lyu. De- tecting, explaining, and mitigating memorization in diffusion models. InICLR, 2024. 1, 3, 4, 7, 2

  37. [37]

    Towards understanding the working mechanism of text-to-image dif- fusion model

    Mingyang Yi, Aoxue Li, Yi Xin, and Zhenguo Li. Towards understanding the working mechanism of text-to-image dif- fusion model. InNeurIPS, 2024. 3, 5

  38. [38]

    Original

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 7 Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings Supplementary Material A. Additional Experimental Results A.1. Applying Our Method To Stable Diffusion v1.4 We apply...