pith. sign in

arxiv: 2508.00756 · v4 · pith:JASBXISSnew · submitted 2025-08-01 · 💻 cs.CR

LeakyCLIP: Extracting Training Data from CLIP

Pith reviewed 2026-05-22 12:27 UTC · model grok-4.3

classification 💻 cs.CR
keywords CLIP inversiontraining data extractionprivacy leakagemultimodal modelsadversarial fine-tuningmembership inferenceimage reconstruction
0
0 comments X

The pith

LeakyCLIP recovers training images from CLIP embeddings by inverting the model with adversarial fine-tuning, alignment, and diffusion refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how CLIP models can leak their training data through inversion attacks that reconstruct images from text embeddings. It identifies three obstacles to high-quality reconstruction: non-robust features, limited semantics in text embeddings, and low output fidelity. LeakyCLIP counters these with adversarial fine-tuning for smoother optimization, a linear transformation to align embeddings, and Stable Diffusion to sharpen the final images. On a LAION-2B subset the method improves SSIM by more than 258 percent for ViT-B-16 over baselines, and even low-fidelity outputs still reveal which images were used in training.

Core claim

LeakyCLIP is a CLIP inversion framework that reconstructs semantically accurate training images from embeddings by combining adversarial fine-tuning to smooth optimization, linear transformation-based embedding alignment, and Stable Diffusion refinement to boost fidelity. On a LAION-2B subset it raises SSIM by over 258 percent for ViT-B-16 relative to prior methods, and membership inference succeeds from the metrics of even low-fidelity reconstructions.

What carries the argument

LeakyCLIP inversion pipeline that mitigates non-robust features via adversarial fine-tuning, aligns text and image embeddings with a learned linear transform, and refines outputs with Stable Diffusion.

If this is right

  • CLIP-based models carry a concrete risk of exposing private training images through embedding inversion.
  • Even imperfect reconstructions can be used to detect whether a given image was in the training set.
  • Multimodal models that rely on CLIP-style contrastive training inherit similar leakage surfaces.
  • Defenses must address both high-fidelity reconstruction and low-fidelity membership signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar inversion techniques could be adapted to other vision-language models that produce joint embeddings.
  • Organizations releasing CLIP checkpoints may need to audit training sets or add differential privacy at the embedding stage.
  • The success of low-fidelity membership inference suggests that simple statistical checks on reconstruction quality could serve as a cheap leakage detector.

Load-bearing premise

The three listed challenges in CLIP inversion can be fixed by the proposed fine-tuning, alignment, and diffusion steps without creating new problems that erase the reported gains.

What would settle it

A controlled test on the same ViT-B-16 model and LAION-2B subset in which the SSIM of LeakyCLIP reconstructions falls below 150 percent of the strongest baseline.

Figures

Figures reproduced from arXiv: 2508.00756 by Ran He, Shujie Wang, Xingjun Ma, Xin Wang, Yu-Gang Jiang, Yunhao Chen.

Figure 1
Figure 1. Figure 1: The first row presents reconstructed images generated by LeakyCLIP, while the second [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LeakyCLIP for Training Data Extraction. (i) Adversarial Fine-Tuning (AFT): The image [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Top: Reconstructed images by LeakyCLIP. Bottom: The original images and metric values. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Gradient Distribution Comparison: Histograms showing that adversarially fine-tuned CLIP [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The relationship between the reconstruction similarity of image embedding in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Highly similar faces extracted face data [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of L1 and Frobenius Norms Across CLIP Models. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Empirical validation of Theorem A.2 and Corollary A.3. The plot depicts the relationship [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The relationship between the number of tokens and the logarithm of inversion loss. This [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reconstruction quality results for ViT-L-14 fine-tuned with different values of eps. The [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visual Comparison of LeakyCLIP Methods. From left to right: Baseline results, AFT+EA [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual Comparison of LeakyCLIP Methods. From left to right: Baseline results, AFT+EA [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
read the original abstract

Understanding the memorization and privacy leakage risks in Contrastive Language--Image Pretraining (CLIP) is critical for ensuring the security of multimodal models. Recent studies have demonstrated the feasibility of extracting sensitive training examples from diffusion models, with conditional diffusion models exhibiting a stronger tendency to memorize and leak information. In this work, we investigate data memorization and extraction risks in CLIP through the lens of CLIP inversion, a process that aims to reconstruct training images from text prompts. To this end, we introduce \textbf{LeakyCLIP}, a novel attack framework designed to achieve high-quality, semantically accurate image reconstruction from CLIP embeddings. We identify three key challenges in CLIP inversion: 1) non-robust features, 2) limited visual semantics in text embeddings, and 3) low reconstruction fidelity. To address these challenges, LeakyCLIP employs 1) adversarial fine-tuning to enhance optimization smoothness, 2) linear transformation-based embedding alignment, and 3) Stable Diffusion-based refinement to improve fidelity. Empirical results demonstrate the superiority of LeakyCLIP, achieving over 258% improvement in Structural Similarity Index Measure (SSIM) for ViT-B-16 compared to baseline methods on LAION-2B subset. Furthermore, we uncover a pervasive leakage risk, showing that training data membership can even be successfully inferred from the metrics of low-fidelity reconstructions. Our work introduces a practical method for CLIP inversion while offering novel insights into the nature and scope of privacy risks in multimodal models. The code is in the https://github.com/dongdongunique/LeakyCLIP

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LeakyCLIP, a framework for inverting CLIP text embeddings to reconstruct training images. It identifies three challenges (non-robust features, limited visual semantics in text embeddings, low reconstruction fidelity) and mitigates them via adversarial fine-tuning, linear transformation alignment, and Stable Diffusion refinement. On a LAION-2B subset it reports >258% SSIM improvement for ViT-B-16 over baselines plus successful membership inference from low-fidelity reconstruction metrics; code is released.

Significance. If the gains can be shown to arise from CLIP-specific information rather than external generative priors, the work would provide a concrete, reproducible demonstration of memorization leakage in CLIP and a practical inversion attack. The open-source code and quantitative SSIM/membership results are positive for verifiability.

major comments (2)
  1. [§3.3 and §4] Refinement step (described in §3.3 and evaluated in §4): no ablation is reported that removes Stable Diffusion refinement or replaces it with a non-generative upsampler. Because Stable Diffusion was trained on data overlapping LAION-2B, the 258% SSIM gain and membership-inference signal could be driven by the generative prior rather than CLIP embeddings; this directly affects the central claim that the attack extracts CLIP training data.
  2. [§5] Membership-inference experiment (reported in abstract and §5): the paper states that membership can be inferred from low-fidelity metrics but provides neither the precise decision rule, the feature set, nor statistical significance tests or baseline false-positive rates. This is load-bearing for the privacy-risk conclusion.
minor comments (2)
  1. [Abstract] Abstract: the GitHub URL is given but the camera-ready version should confirm that the repository contains the exact scripts and hyperparameters used for the reported ViT-B-16 / LAION-2B numbers.
  2. [§3.2] Notation: ensure that “CLIP inversion” and “embedding alignment” are defined once and used consistently; the linear transformation is introduced without an equation number in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps clarify the contributions and limitations of our work on CLIP inversion and privacy leakage. We address each major comment below and will revise the manuscript accordingly to strengthen the evidence that our results reflect CLIP-specific memorization rather than external priors.

read point-by-point responses
  1. Referee: [§3.3 and §4] Refinement step (described in §3.3 and evaluated in §4): no ablation is reported that removes Stable Diffusion refinement or replaces it with a non-generative upsampler. Because Stable Diffusion was trained on data overlapping LAION-2B, the 258% SSIM gain and membership-inference signal could be driven by the generative prior rather than CLIP embeddings; this directly affects the central claim that the attack extracts CLIP training data.

    Authors: We agree that an ablation isolating the Stable Diffusion refinement is necessary to substantiate that the reported gains derive from CLIP embedding information rather than the generative prior. In the revised manuscript we will add this ablation by (i) removing the refinement entirely and (ii) replacing it with a non-generative upsampler (bicubic interpolation followed by a lightweight super-resolution network trained on a disjoint dataset). We will report the resulting SSIM, perceptual metrics, and membership-inference performance for both variants on the same LAION-2B subset. Preliminary internal checks indicate that the core inversion (adversarial fine-tuning + linear alignment) already yields measurable improvement over baselines even without refinement; the new experiments will quantify how much additional gain is attributable to Stable Diffusion. We will also discuss the known training-data overlap and its implications for interpreting the results. revision: yes

  2. Referee: [§5] Membership-inference experiment (reported in abstract and §5): the paper states that membership can be inferred from low-fidelity metrics but provides neither the precise decision rule, the feature set, nor statistical significance tests or baseline false-positive rates. This is load-bearing for the privacy-risk conclusion.

    Authors: We acknowledge that the membership-inference protocol requires additional specification for reproducibility and to support the privacy conclusions. In the revised version we will explicitly state: (1) the decision rule (thresholding on a linear combination of SSIM, LPIPS, and cosine similarity between reconstructed and original embeddings), (2) the complete feature set, and (3) statistical evaluation including AUC-ROC, p-values from permutation tests, and false-positive rates measured on a balanced set of non-member images drawn from the same LAION-2B distribution. We will also include baseline comparisons against random guessing and against a simple embedding-norm threshold. These additions will be placed in §5 with corresponding tables and figures. revision: yes

Circularity Check

0 steps flagged

No circularity: LeakyCLIP results rest on empirical comparisons to external baselines

full rationale

The paper proposes an empirical attack pipeline (adversarial fine-tuning + linear alignment + Stable Diffusion refinement) to invert CLIP embeddings and reports SSIM gains plus membership inference on LAION-2B. These outcomes are measured against independent baseline methods rather than being defined or forced by the method itself; no equation or claim reduces to a fitted parameter renamed as a prediction, and no load-bearing step collapses to a self-citation or internal redefinition of the target quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The work relies on standard machine-learning optimization and pre-trained diffusion models without introducing new mathematical axioms or postulated entities; hyperparameters for fine-tuning and alignment are implicit but not enumerated as load-bearing free parameters in the abstract.

free parameters (1)
  • adversarial fine-tuning hyperparameters
    Parameters controlling optimization smoothness are chosen to address non-robust features but their specific values are not detailed in the abstract.

pith-pipeline@v0.9.0 · 5835 in / 1200 out tokens · 48952 ms · 2026-05-22T12:27:32.519679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 5 internal anchors

  1. [1]

    BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

    Jiawang Bai et al. “BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, pp. 24239–24250

  2. [2]

    Extracting training data from large language models

    Nicholas Carlini et al. “Extracting training data from large language models”. In:30th USENIX Security Symposium (USENIX Security 21). 2021, pp. 2633–2650

  3. [3]

    Extracting training data from diffusion models

    Nicolas Carlini et al. “Extracting training data from diffusion models”. In:USENIX Security. 2023

  4. [4]

    Knowledge-enriched distributional model inversion attacks

    Sheng Chen et al. “Knowledge-enriched distributional model inversion attacks”. In: CVPR. 2021

  5. [5]

    A comprehensive survey for generative data augmentation

    Yunhao Chen, Zihui Yan, and Yunjie Zhu. “A comprehensive survey for generative data augmentation”. In: Neurocomputing (2024), p. 128167

  6. [6]

    Extracting training data from unconditional diffusion models

    Yunhao Chen et al. “Extracting training data from unconditional diffusion models”. In:arXiv preprint arXiv:2410.02467 (2024)

  7. [7]

    ImageNet: A large-scale hierarchical image database

    Jia Deng et al. “ImageNet: A large-scale hierarchical image database”. In: CVPR. 2009, pp. 248–255

  8. [8]

    Model inversion attacks that exploit confidence information and basic countermeasures

    Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. “Model inversion attacks that exploit confidence information and basic countermeasures”. In: CCS 2015. 2015. 10

  9. [9]

    Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing

    Matt Fredrikson et al. “Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing”. In: USENIX Security 2014. 2014

  10. [10]

    Clipag: Towards generator-free text-to-image generation

    Roy Ganz and Michael Elad. “Clipag: Towards generator-free text-to-image generation”. In: WCCV. 2024

  11. [11]

    Do perceptually aligned gradients imply robust- ness?

    Roy Ganz, Bahjat Kawar, and Michael Elad. “Do perceptually aligned gradients imply robust- ness?” In: ICML. PMLR. 2023, pp. 10628–10648

  12. [12]

    On memorization in diffusion models

    Xiangming Gu et al. “On memorization in diffusion models”. In: arXiv preprint arXiv:2310.02664 (2023)

  13. [13]

    Does clip know my face?

    Dominik Hintersdorf et al. “Does clip know my face?” In:Journal of Artificial Intelligence Research 80 (2024), pp. 1033–1062

  14. [14]

    Labeled faces in the wild: A database forstudying face recognition in unconstrained environments

    Gary B Huang et al. “Labeled faces in the wild: A database forstudying face recognition in unconstrained environments”. In:Workshop on faces in’Real-Life’Images: detection, alignment, and recognition. 2008

  15. [15]

    Ilharco, M

    Gabriel Ilharco et al. OpenCLIP. Version 0.1. July 2021. DOI: 10.5281/zenodo.5143773. URL: https://doi.org/10.5281/zenodo.5143773

  16. [16]

    Adversarial examples are not bugs, they are features

    Andrew Ilyas et al. “Adversarial examples are not bugs, they are features”. In: NIPS (2019)

  17. [17]

    D \’ej\a Vu Memorization in Vision-Language Models

    Bargav Jayaraman, Chuan Guo, and Kamalika Chaudhuri. “D \’ej\a Vu Memorization in Vision-Language Models”. In: arXiv preprint arXiv:2402.02103 (2024)

  18. [18]

    Guiding the long-short term memory model for image caption generation

    Xu Jia et al. “Guiding the long-short term memory model for image caption generation”. In: ICCV. 2015, pp. 2407–2415

  19. [19]

    Challenges and applications of large language models

    Jean Kaddour et al. “Challenges and applications of large language models”. In:arXiv preprint arXiv:2307.10169 (2023)

  20. [20]

    Label-only model inversion attacks via boundary repulsion

    Mostafa Kahla et al. “Label-only model inversion attacks via boundary repulsion”. In:CVPR. 2022, pp. 15045–15053

  21. [21]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks”. In: CVPR. 2019

  22. [22]

    What do we learn from inverting CLIP models?

    Hamid Kazemi et al. “What do we learn from inverting CLIP models?” In: arXiv preprint arXiv:2403.02580 (2024)

  23. [23]

    LAION-HD Subset Dataset

    Yuval Kirstain. LAION-HD Subset Dataset . https : / / huggingface . co / datasets / yuvalkirstain/laion-hd-subset. 2024

  24. [24]

    A Comprehensive Analysis of Memorization in Large Language Models

    Hirokazu Kiyomaru et al. “A Comprehensive Analysis of Memorization in Large Language Models”. In: ACL. 2024

  25. [25]

    Deep learning face attributes in the wild

    Ziwei Liu et al. “Deep learning face attributes in the wild”. In: ICCV. 2015, pp. 3730–3738

  26. [26]

    sample_furniture_object

    Abrar Lohia. sample_furniture_object. Hugging Face, 2024. URL: https://huggingface. co/datasets/abrarlohia/sample_furniture_object

  27. [27]

    Image segmentation using text and image prompts

    Timo Lüddecke and Alexander Ecker. “Image segmentation using text and image prompts”. In: CVPR. 2022, pp. 7086–7096

  28. [28]

    Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

    Xingjun Ma et al. “Safety at scale: A comprehensive survey of large model safety”. In:arXiv preprint arXiv:2502.05206 (2025)

  29. [29]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry. “Towards deep learning models resistant to adversarial attacks”. In:arXiv preprint arXiv:1706.06083 (2017)

  30. [30]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng et al. “Sdedit: Guided image synthesis and editing with stochastic differential equations”. In: arXiv preprint arXiv:2108.01073 (2021)

  31. [31]

    The application of large language models in medicine: A scoping review

    Xiangbin Meng et al. “The application of large language models in medicine: A scoping review”. In: Iscience 27.5 (2024)

  32. [32]

    Re-thinking model inversion attacks against deep neural networks

    Ngoc-Bao Nguyen et al. “Re-thinking model inversion attacks against deep neural networks”. In: CVPR. 2023

  33. [33]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol et al. “Glide: Towards photorealistic image generation and editing with text-guided diffusion models”. In: arXiv preprint arXiv:2112.10741 (2021)

  34. [34]

    Ldp-feat: Image features with local differential privacy

    Francesco Pittaluga and Bingbing Zhuang. “Ldp-feat: Image features with local differential privacy”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision . 2023, pp. 17580–17590

  35. [35]

    A Self-Supervised Descriptor for Image Copy Detection

    Ed Pizzi et al. “A Self-Supervised Descriptor for Image Copy Detection”. In: CVPR (2022)

  36. [36]

    Diffusers: State-of-the-art diffusion models

    Patrick von Platen et al. Diffusers: State-of-the-art diffusion models. https://github.com/ huggingface/diffusers. 2022. 11

  37. [37]

    Learning transferable visual models from natural language supervision

    Alec Radford et al. “Learning transferable visual models from natural language supervision”. In: ICML. 2021

  38. [38]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach et al. “High-resolution image synthesis with latent diffusion models”. In: CVPR. 2022, pp. 10684–10695

  39. [39]

    Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

    Christian Schlarmann et al. “Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models”. In:ICML (2024)

  40. [40]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann et al. “Laion-5b: An open large-scale dataset for training next generation image-text models”. In: NIPS (2022)

  41. [41]

    Diffusion art or digital forgery? investigating data replication in diffusion models

    Gowthami Somepalli et al. “Diffusion art or digital forgery? investigating data replication in diffusion models”. In: CVPR. 2023

  42. [42]

    Understanding and mitigating copying in diffusion models

    Gowthami Somepalli et al. “Understanding and mitigating copying in diffusion models”. In: NIPS (2023)

  43. [43]

    Plug & play attacks: Towards robust and flexible model inversion attacks

    Lukas Struppek et al. “Plug & play attacks: Towards robust and flexible model inversion attacks”. In: ICML. 2022

  44. [44]

    Contrastive Learning is Spectral Clustering on Similarity Graph

    Zhiquan Tan et al. “Contrastive Learning is Spectral Clustering on Similarity Graph”. In:ICLR. 2024

  45. [45]

    Captured by captions: On memorization and its mitigation in CLIP models

    Wenhao Wang et al. “Captured by captions: On memorization and its mitigation in CLIP models”. In: arXiv preprint arXiv:2502.07830 (2025)

  46. [46]

    Memorization in self-supervised learning improves downstream general- ization

    Wenhao Wang et al. “Memorization in self-supervised learning improves downstream general- ization”. In: arXiv preprint arXiv:2401.12233 (2024)

  47. [47]

    Image quality assessment: from error visibility to structural similarity

    Zhou Wang et al. “Image quality assessment: from error visibility to structural similarity”. In: IEEE transactions on image processing 13.4 (2004), pp. 600–612

  48. [48]

    Multi-modal Identity Extraction

    Ryan Webster and Teddy Furon. “Multi-modal Identity Extraction”. In: ICCV 2025- International Conference on Computer Vision. 2025

  49. [49]

    Memorization in deep learning: A survey

    Jiaheng Wei et al. “Memorization in deep learning: A survey”. In: arXiv preprint arXiv:2406.03880 (2024)

  50. [50]

    Demystifying CLIP Data

    Hu Xu et al. “Demystifying clip data”. In: arXiv preprint arXiv:2309.16671 (2023)

  51. [51]

    Pseudo label-guided model inversion attack via conditional generative adversarial network

    Xiaojian Yuan et al. “Pseudo label-guided model inversion attack via conditional generative adversarial network”. In: AAAI. 2023

  52. [52]

    SecretGen: Privacy Recovery on Pre-trained Models via Distribution Discrimination

    Zhuowen Yuan et al. “SecretGen: Privacy Recovery on Pre-trained Models via Distribution Discrimination”. In: ECCV. 2022

  53. [53]

    Long-clip: Unlocking the long-text capability of clip

    Beichen Zhang et al. “Long-clip: Unlocking the long-text capability of clip”. In: ECCV. 2025

  54. [54]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang et al. “The unreasonable effectiveness of deep features as a perceptual metric”. In: CVPR. 2018

  55. [55]

    The secret revealer: Generative model-inversion attacks against deep neural networks

    Yuheng Zhang et al. “The secret revealer: Generative model-inversion attacks against deep neural networks”. In: CVPR. 2020

  56. [56]

    TwinDiffusion: Enhancing Coherence and Efficiency in Panoramic Image Generation with Diffusion Models

    Teng Zhou and Yongchuan Tang. “TwinDiffusion: Enhancing Coherence and Efficiency in Panoramic Image Generation with Diffusion Models”. In:arXiv preprint arXiv:2404.19475 (2024). 12 A Theoretical Analysis and Understanding In this section, we analyze the theoretical feasibility of recovering training images from text embed- dings, assuming the image encode...

  57. [57]

    discoverability phenomenon

    (12) Thus, this condition can be interpreted as constraining the variance of qk(z) in the latent space, which can be achieved with a careful selection of the informative label. Theorem A.2 is consistent with prior works [42, 6]. They show that unconditional models don’t replicate data, while text-conditioning increases memorization. And previous work [12]...