Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift

Gihoon Kim; Hyungjin Park; Taesup Kim

arxiv: 2505.19519 · v3 · submitted 2025-05-26 · 💻 cs.CV

Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift

Gihoon Kim , Hyungjin Park , Taesup Kim This is my paper

Pith reviewed 2026-05-19 13:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords personalized diffusion modelstext-to-image generationLipschitz regularizationdistributional driftfine-tuningprompt adherenceoverfitting preventiongenerative model adaptation

0 comments

The pith

A Lipschitz constraint on parameter updates during personalization keeps text-to-image diffusion models from drifting away from their original output distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard personalization of diffusion models overfits to a handful of reference images and stops obeying new text prompts because it lets the output distribution wander. To fix this, the authors add a regularization term that limits how far any parameter can move, measured through its Lipschitz constant. This bound keeps generated images diverse and prompt-responsive while still letting the model learn the new concept. The method is cheaper than repeated sampling checks and works on multiple diffusion backbones. If the claim holds, users could adapt models to personal objects or styles without losing the broad capabilities the original training gave them.

Core claim

The central claim is that a Lipschitz-based regularization objective, applied during the personalization fine-tuning stage, constrains parameter updates so that the model's output distribution stays close to the pretrained distribution. This simultaneously achieves high fidelity to the reference images and strong adherence to novel text prompts, outperforming prior methods that rely only on reconstruction or prompt-alignment losses.

What carries the argument

Lipschitz-based regularization objective that penalizes large changes in the model's parameters during fine-tuning on reference images.

If this is right

Personalized models retain the ability to generate varied images for the same prompt instead of copying the reference set.
Text alignment improves because the model cannot collapse to simply reproducing the training images.
The approach replaces expensive sampling-based distribution checks with a direct parameter-space penalty that is cheaper to compute.
The same regularization can be added to different diffusion architectures without changing their training pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bounded-update idea might transfer to other generative tasks such as personalized video or 3-D synthesis where distribution drift is also a problem.
If the Lipschitz bound can be made adaptive per layer, it could further reduce the trade-off between concept learning and preservation.
The method suggests a general principle for fine-tuning large models: keep the change in function space small even when the parameter change appears large.

Load-bearing premise

Limiting the size of parameter updates through a Lipschitz bound is enough to stop the output distribution from drifting while still allowing the model to learn a new visual concept from a few images.

What would settle it

A test set of prompts and reference images where the regularized model still produces outputs whose distribution (measured by Fréchet distance or similar) deviates as much from the original model as an unregularized personalization run.

Figures

Figures reproduced from arXiv: 2505.19519 by Gihoon Kim, Hyungjin Park, Taesup Kim.

**Figure 2.** Figure 2: Effect of Lipschitz regularization: Increasing λ reduces both ∆θ and ∆ϵ. result suggests that, unlike conventional methods that primarily preserve the subject class (e.g., “dog” in “a my dog in the snow”), our approach can retain broader contextual generality (e.g., “the snow”) during personalization. This also demonstrates that the proposed method can maintain the pretrained distribution even without exp… view at source ↗

**Figure 3.** Figure 3: Visual comparison with baseline methods text-image alignment based on cosine similarity between the prompt and the generated image. The special token is excluded when computing CLIP-T scores. 5.2 Comparisons with Baseline Methods Qualitative Comparison. The qualitative results are presented in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Trade-off curves between image fidelity and text alignment scores across different [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Visual comparison under different regularization strengths. Both [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Additional visual Comparison with other baselines. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Additional visual Comparison with other baselines 2. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Visual comparison of failure cases across baselines. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Visual comparison of failure cases across baselines 2. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Additional visual Comparison with other baselines. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: ImageNet class generation: In the case of the "tiger shark" class, DreamBooth generates [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: ImageNet class generation: In the case of the "gold fish" class, we observe that increasing [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization results under different regularization strengths [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Visualization of ablated results under different [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

Personalizing text-to-image diffusion models involves integrating novel visual concepts from a small set of reference images while retaining the model's original generative capabilities. However, this process often leads to overfitting, where the model ignores the user's prompt and merely replicates the reference images. We attribute this issue to a fundamental misalignment between the true goals of personalization, which are subject fidelity and text alignment, and the training objectives of existing methods that fail to enforce both objectives simultaneously. Specifically, prior approaches often overlook the need to explicitly preserve the pretrained model's output distribution, resulting in distributional drift that undermines diversity and coherence. To resolve these challenges, we introduce a Lipschitz-based regularization objective that constrains parameter updates during personalization, ensuring bounded deviation from the original distribution. This promotes consistency with the pretrained model's behavior while enabling accurate adaptation to new concepts. Furthermore, our method offers a computationally efficient alternative to commonly used, resource-intensive sampling techniques. Through extensive experiments across diverse diffusion model architectures, we demonstrate that our approach achieves superior performance in both quantitative metrics and qualitative evaluations, consistently excelling in visual fidelity and prompt adherence. We further support these findings with comprehensive analyses, including ablation studies and visualizations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lipschitz regularization on parameters during personalization is a reasonable practical fix but the claimed link to bounded distributional drift after full diffusion sampling still needs tighter justification.

read the letter

The paper's core move is to add a Lipschitz-based regularizer on parameter updates while fine-tuning a diffusion model on a few reference images. The goal is to keep the output distribution from drifting too far from the pretrained one, which they argue causes the usual overfitting where the model just copies the references instead of following new prompts. They position this as more efficient than heavy sampling-based preservation tricks and test it across several architectures with ablations and both quantitative and qualitative results showing better fidelity and text alignment.

Referee Report

2 major / 2 minor

Summary. The paper claims that a Lipschitz-based regularization objective constrains parameter updates during fine-tuning of text-to-image diffusion models for personalization, ensuring bounded deviation from the pretrained output distribution. This prevents overfitting and distributional drift while enabling adaptation to novel concepts from few reference images, yielding superior visual fidelity and prompt adherence across architectures, as shown via quantitative metrics, qualitative results, ablations, and visualizations. It positions the method as a computationally efficient alternative to sampling-based techniques.

Significance. If the regularization truly produces bounded distributional drift without relying on post-hoc tuning or unstated assumptions, the work would offer a practical and theoretically motivated solution to a core limitation in personalized diffusion models. It could reduce reliance on expensive inference-time methods and improve reliability of subject-driven generation in applications.

major comments (2)

[Abstract and §3] Abstract and §3 (regularization objective): the claim that the Lipschitz constraint 'ensures bounded deviation from the original distribution' lacks any derivation connecting the parameter-level regularization to a bound on the push-forward measure after the full multi-step denoising process. Diffusion sampling composes dozens of network calls; a local Lipschitz bound on weights does not automatically yield a uniform bound on KL, Wasserstein, or total-variation distance to the pretrained distribution unless Lipschitz continuity of the score network along the trajectory or control on the noise schedule is established.
[§4] §4 (experiments and ablations): no empirical measurement of distributional drift (e.g., on held-out prompts or via FID/KL proxies between pretrained and personalized models) is reported. The superiority claims rest on visual fidelity and prompt adherence metrics, but without direct quantification of the claimed preservation mechanism the central causal link remains unverified.

minor comments (2)

[§3] Notation for the regularization strength lambda should be introduced with an explicit sensitivity study or selection protocol to avoid the appearance that performance gains depend on implicit fitting.
[Tables 1-3] Table captions and axis labels in quantitative results would benefit from clearer units and baseline descriptions for immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps clarify important aspects of our theoretical claims and empirical validation. We respond to each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (regularization objective): the claim that the Lipschitz constraint 'ensures bounded deviation from the original distribution' lacks any derivation connecting the parameter-level regularization to a bound on the push-forward measure after the full multi-step denoising process. Diffusion sampling composes dozens of network calls; a local Lipschitz bound on weights does not automatically yield a uniform bound on KL, Wasserstein, or total-variation distance to the pretrained distribution unless Lipschitz continuity of the score network along the trajectory or control on the noise schedule is established.

Authors: We appreciate the referee's observation regarding the theoretical gap. Our §3 derives a bound on the change in the network's output for a single forward pass under the Lipschitz regularization on parameters. However, extending this to a bound on the distributional distance after the complete multi-step sampling process indeed requires additional analysis, such as propagating the Lipschitz constant through the denoising trajectory and considering the specific noise schedule. We did not provide a complete end-to-end derivation in the original manuscript. To address this, we will revise §3 to include a more explicit discussion of the assumptions and a sketch of how the per-step bound implies controlled deviation in the generated distribution, or clarify the scope of our theoretical claims. revision: yes
Referee: [§4] §4 (experiments and ablations): no empirical measurement of distributional drift (e.g., on held-out prompts or via FID/KL proxies between pretrained and personalized models) is reported. The superiority claims rest on visual fidelity and prompt adherence metrics, but without direct quantification of the claimed preservation mechanism the central causal link remains unverified.

Authors: The referee correctly identifies that we have not directly quantified distributional drift in our experiments. Our evaluations emphasize the practical outcomes of subject fidelity and prompt adherence, which indirectly reflect preservation of the original capabilities. To provide direct evidence for the preservation mechanism, we will add new experiments in the revised §4, including measurements such as FID between samples from the pretrained and fine-tuned models on a set of held-out prompts, as well as other proxies for distributional similarity. This will help substantiate the causal role of the Lipschitz regularization in mitigating drift. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation introduces independent regularization

full rationale

The paper introduces a Lipschitz-based regularization objective on parameter updates as a novel mechanism to bound deviation from the pretrained distribution during personalization. No equations or sections in the provided abstract and description reduce this claim to a self-definition, a fitted hyperparameter renamed as a prediction, or a load-bearing self-citation chain. The central premise rests on the proposed constraint itself rather than re-deriving an input quantity, and the method is positioned as an efficient alternative to sampling techniques with independent empirical validation across architectures. This qualifies as a self-contained contribution without the specific reductions required for a positive circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a Lipschitz constraint on parameter changes directly controls output distribution shift in diffusion models, plus at least one tunable regularization weight that must be chosen to balance fidelity and preservation.

free parameters (1)

regularization strength lambda
Controls the tightness of the Lipschitz bound; must be selected or tuned to achieve the reported trade-off between subject fidelity and prompt adherence.

axioms (1)

domain assumption Lipschitz continuity of the parameter update map implies bounded deviation in the generated output distribution
Invoked to justify why the regularization prevents distributional drift without further proof or empirical calibration shown in the abstract.

pith-pipeline@v0.9.0 · 5736 in / 1160 out tokens · 45749 ms · 2026-05-19T13:59:07.647998+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

If the diffusion model ε_θ is Lipschitz continuous in θ, then … D_KL(p_θ1 ∥ p_θ2) ≤ λ · ||θ1 − θ2||_k (Theorem 2)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lipschitz regularization on ∥θ_per − θ_base∥_k provides a bound on the distributional shift

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 8 internal anchors

[1]

Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

work page 2021
[2]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[3]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[4]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021
[5]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[7]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

work page 2022
[8]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[9]

Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022
[10]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023
[13]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Prompting hard or hardly prompting: Prompt inversion for text-to-image diffusion models

Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, and Leonid Sigal. Prompting hard or hardly prompting: Prompt inversion for text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6808–6817, 2024

work page 2024
[15]

Generating images of rare con- cepts using pre-trained diffusion models

Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. Generating images of rare con- cepts using pre-trained diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4695–4703, 2024

work page 2024
[16]

Norm-guided latent space exploration for text-to-image generation.Advances in Neural Information Processing Systems, 36:57863– 57875, 2023

Dvir Samuel, Rami Ben-Ari, Nir Darshan, Haggai Maron, and Gal Chechik. Norm-guided latent space exploration for text-to-image generation.Advances in Neural Information Processing Systems, 36:57863– 57875, 2023

work page 2023
[17]

Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models

Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023

work page 2023
[18]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[19]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 10

work page 1931
[20]

Key-locked rank one editing for text-to-image personalization

Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personalization. InACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023

work page 2023
[21]

Controlling text-to-image diffusion by orthogonal finetuning.Advances in Neural Information Processing Systems, 36:79320–79362, 2023

Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning.Advances in Neural Information Processing Systems, 36:79320–79362, 2023

work page 2023
[22]

Svdiff: Compact parameter space for diffusion fine-tuning

Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, and Feng Yang. Svdiff: Compact parameter space for diffusion fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7323–7334, 2023

work page 2023
[23]

arXiv preprint arXiv:2303.09522 (2023)

Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. p+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023

work page arXiv 2023
[24]

Dreamblend: Advancing personalized fine-tuning of text-to-image diffusion models

Shwetha Ram, Tal Neiman, Qianli Feng, Andrew Stuart, Son Tran, and Trishul Chilimbi. Dreamblend: Advancing personalized fine-tuning of text-to-image diffusion models. 2025

work page 2025
[25]

Para: Personalizing text-to-image diffusion via parameter rank reduction.arXiv preprint arXiv:2406.05641, 2024

Shangyu Chen, Zizheng Pan, Jianfei Cai, and Dinh Phung. Para: Personalizing text-to-image diffusion via parameter rank reduction.arXiv preprint arXiv:2406.05641, 2024

work page arXiv 2024
[26]

Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022
[27]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146– 30166, 2023

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146– 30166, 2023

work page 2023
[28]

Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation

Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023

work page 2023
[29]

Subject-driven text-to-image generation via apprenticeship learning.Advances in Neural Information Processing Systems, 36:30286–30305, 2023

Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W Cohen. Subject-driven text-to-image generation via apprenticeship learning.Advances in Neural Information Processing Systems, 36:30286–30305, 2023

work page 2023
[30]

Kosmos-g: Generating images in context with multimodal large language models

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models. InThe Twelfth International Conference on Learning Representations

work page
[31]

Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning

Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024

work page 2024
[32]

Dream- cache: Finetuning-free lightweight personalized image generation via feature caching.arXiv preprint arXiv:2411.17786, 2024

Emanuele Aiello, Umberto Michieli, Diego Valsesia, Mete Ozay, and Enrico Magli. Dream- cache: Finetuning-free lightweight personalized image generation via feature caching.arXiv preprint arXiv:2411.17786, 2024

work page arXiv 2024
[33]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

work page 2023
[34]

T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 4296–4304, 2024

work page 2024
[35]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

work page 2022
[36]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[37]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Delta denoising score

Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta denoising score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2328–2337, 2023. 11

work page 2023
[39]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in Neural Information Processing Systems, 36:8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in Neural Information Processing Systems, 36:8406–8441, 2023

work page 2023
[40]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

work page 2021
[41]

Dreamsteerer: Enhancing source image conditioned editability using personalized diffusion models.Advances in Neural Information Processing Systems, 37:120699–120734, 2024

Zhengyang Yu, Zhaoyuan Yang, and Jing Zhang. Dreamsteerer: Enhancing source image conditioned editability using personalized diffusion models.Advances in Neural Information Processing Systems, 37:120699–120734, 2024

work page 2024
[42]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024
[43]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024
[44]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

work page 2024
[45]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

work page 2015
[46]

Tutorial on diffusion models for imaging and vision.Foundations and Trends® in Computer Graphics and Vision, 16(4):322–471, 2024

Stanley Chan et al. Tutorial on diffusion models for imaging and vision.Foundations and Trends® in Computer Graphics and Vision, 16(4):322–471, 2024

work page 2024
[47]

Understanding diffusion models: A unified perspective,

Calvin Luo. Understanding diffusion models: A unified perspective.arXiv preprint arXiv:2208.11970, 2022

work page arXiv 2022
[48]

Step-by-step diffusion: An elementary tutorial.arXiv preprint arXiv:2406.08929, 2024

Preetum Nakkiran, Arwen Bradley, Hattie Zhou, and Madhu Advani. Step-by-step diffusion: An elementary tutorial.arXiv preprint arXiv:2406.08929, 2024

work page arXiv 2024
[49]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022
[51]

Deep learning.nature, 521(7553):436–444, 2015

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436–444, 2015

work page 2015
[52]

Norm-based capacity control in neural networks

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. InConference on learning theory, pages 1376–1401. PMLR, 2015

work page 2015
[53]

Lipschitz continuity in model-based reinforcement learning

Kavosh Asadi, Dipendra Misra, and Michael Littman. Lipschitz continuity in model-based reinforcement learning. InInternational Conference on Machine Learning, pages 264–273. PMLR, 2018

work page 2018
[54]

Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

Bradley Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

work page 2011
[55]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019

work page 2019
[56]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[57]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 12

work page 2021
[58]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[59]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016
[60]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[61]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[62]

Faster deep reinforcement learning with slower online network.Advances in Neural Information Processing Systems, 35:19944–19955, 2022

Kavosh Asadi, Rasool Fakoor, Omer Gottesman, Taesup Kim, Michael Littman, and Alexander J Smola. Faster deep reinforcement learning with slower online network.Advances in Neural Information Processing Systems, 35:19944–19955, 2022

work page 2022
[63]

Bregman voronoi diagrams: Properties, algorithms and applications.arXiv preprint arXiv:0709.2196, 2007

Frank Nielsen, Jean-Daniel Boissonnat, and Richard Nock. Bregman voronoi diagrams: Properties, algorithms and applications.arXiv preprint arXiv:0709.2196, 2007

work page arXiv 2007
[64]

tiger shark

TM Cover and JA Thomas. Elements of information theory 2nd ed., 2006. 13 Appendix A Proof of Theorem 1 Theorem 1.Let p∗(x, c) denote the reference distribution, and let the model parameters θbase have distributionp θbase(x, c)satisfying, for anyϵ >0, p∗(x, c)−p θbase(x, c) < ε. The model is adapted by gradient descent on the denoising loss LDenoise from E...

work page 2006

[1] [1]

Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

work page 2021

[2] [2]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[3] [3]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[4] [4]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021

[5] [5]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[7] [7]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

work page 2022

[8] [8]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[9] [9]

Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022

[10] [10]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023

[13] [13]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Prompting hard or hardly prompting: Prompt inversion for text-to-image diffusion models

Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, and Leonid Sigal. Prompting hard or hardly prompting: Prompt inversion for text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6808–6817, 2024

work page 2024

[15] [15]

Generating images of rare con- cepts using pre-trained diffusion models

Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. Generating images of rare con- cepts using pre-trained diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4695–4703, 2024

work page 2024

[16] [16]

Norm-guided latent space exploration for text-to-image generation.Advances in Neural Information Processing Systems, 36:57863– 57875, 2023

Dvir Samuel, Rami Ben-Ari, Nir Darshan, Haggai Maron, and Gal Chechik. Norm-guided latent space exploration for text-to-image generation.Advances in Neural Information Processing Systems, 36:57863– 57875, 2023

work page 2023

[17] [17]

Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models

Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023

work page 2023

[18] [18]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022

[19] [19]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 10

work page 1931

[20] [20]

Key-locked rank one editing for text-to-image personalization

Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personalization. InACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023

work page 2023

[21] [21]

Controlling text-to-image diffusion by orthogonal finetuning.Advances in Neural Information Processing Systems, 36:79320–79362, 2023

Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning.Advances in Neural Information Processing Systems, 36:79320–79362, 2023

work page 2023

[22] [22]

Svdiff: Compact parameter space for diffusion fine-tuning

Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, and Feng Yang. Svdiff: Compact parameter space for diffusion fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7323–7334, 2023

work page 2023

[23] [23]

arXiv preprint arXiv:2303.09522 (2023)

Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. p+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023

work page arXiv 2023

[24] [24]

Dreamblend: Advancing personalized fine-tuning of text-to-image diffusion models

Shwetha Ram, Tal Neiman, Qianli Feng, Andrew Stuart, Son Tran, and Trishul Chilimbi. Dreamblend: Advancing personalized fine-tuning of text-to-image diffusion models. 2025

work page 2025

[25] [25]

Para: Personalizing text-to-image diffusion via parameter rank reduction.arXiv preprint arXiv:2406.05641, 2024

Shangyu Chen, Zizheng Pan, Jianfei Cai, and Dinh Phung. Para: Personalizing text-to-image diffusion via parameter rank reduction.arXiv preprint arXiv:2406.05641, 2024

work page arXiv 2024

[26] [26]

Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022

[27] [27]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146– 30166, 2023

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems, 36:30146– 30166, 2023

work page 2023

[28] [28]

Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation

Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023

work page 2023

[29] [29]

Subject-driven text-to-image generation via apprenticeship learning.Advances in Neural Information Processing Systems, 36:30286–30305, 2023

Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W Cohen. Subject-driven text-to-image generation via apprenticeship learning.Advances in Neural Information Processing Systems, 36:30286–30305, 2023

work page 2023

[30] [30]

Kosmos-g: Generating images in context with multimodal large language models

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models. InThe Twelfth International Conference on Learning Representations

work page

[31] [31]

Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning

Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024

work page 2024

[32] [32]

Dream- cache: Finetuning-free lightweight personalized image generation via feature caching.arXiv preprint arXiv:2411.17786, 2024

Emanuele Aiello, Umberto Michieli, Diego Valsesia, Mete Ozay, and Enrico Magli. Dream- cache: Finetuning-free lightweight personalized image generation via feature caching.arXiv preprint arXiv:2411.17786, 2024

work page arXiv 2024

[33] [33]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

work page 2023

[34] [34]

T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 4296–4304, 2024

work page 2024

[35] [35]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

work page 2022

[36] [36]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023

[37] [37]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[38] [38]

Delta denoising score

Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta denoising score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2328–2337, 2023. 11

work page 2023

[39] [39]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in Neural Information Processing Systems, 36:8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in Neural Information Processing Systems, 36:8406–8441, 2023

work page 2023

[40] [40]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

work page 2021

[41] [41]

Dreamsteerer: Enhancing source image conditioned editability using personalized diffusion models.Advances in Neural Information Processing Systems, 37:120699–120734, 2024

Zhengyang Yu, Zhaoyuan Yang, and Jing Zhang. Dreamsteerer: Enhancing source image conditioned editability using personalized diffusion models.Advances in Neural Information Processing Systems, 37:120699–120734, 2024

work page 2024

[42] [42]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024

[43] [43]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024

[44] [44]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

work page 2024

[45] [45]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

work page 2015

[46] [46]

Tutorial on diffusion models for imaging and vision.Foundations and Trends® in Computer Graphics and Vision, 16(4):322–471, 2024

Stanley Chan et al. Tutorial on diffusion models for imaging and vision.Foundations and Trends® in Computer Graphics and Vision, 16(4):322–471, 2024

work page 2024

[47] [47]

Understanding diffusion models: A unified perspective,

Calvin Luo. Understanding diffusion models: A unified perspective.arXiv preprint arXiv:2208.11970, 2022

work page arXiv 2022

[48] [48]

Step-by-step diffusion: An elementary tutorial.arXiv preprint arXiv:2406.08929, 2024

Preetum Nakkiran, Arwen Bradley, Hattie Zhou, and Madhu Advani. Step-by-step diffusion: An elementary tutorial.arXiv preprint arXiv:2406.08929, 2024

work page arXiv 2024

[49] [49]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[50] [50]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022

[51] [51]

Deep learning.nature, 521(7553):436–444, 2015

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436–444, 2015

work page 2015

[52] [52]

Norm-based capacity control in neural networks

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. InConference on learning theory, pages 1376–1401. PMLR, 2015

work page 2015

[53] [53]

Lipschitz continuity in model-based reinforcement learning

Kavosh Asadi, Dipendra Misra, and Michael Littman. Lipschitz continuity in model-based reinforcement learning. InInternational Conference on Machine Learning, pages 264–273. PMLR, 2018

work page 2018

[54] [54]

Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

Bradley Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

work page 2011

[55] [55]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019

work page 2019

[56] [56]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021

[57] [57]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 12

work page 2021

[58] [58]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[59] [59]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016

[60] [60]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[61] [61]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017

[62] [62]

Faster deep reinforcement learning with slower online network.Advances in Neural Information Processing Systems, 35:19944–19955, 2022

Kavosh Asadi, Rasool Fakoor, Omer Gottesman, Taesup Kim, Michael Littman, and Alexander J Smola. Faster deep reinforcement learning with slower online network.Advances in Neural Information Processing Systems, 35:19944–19955, 2022

work page 2022

[63] [63]

Bregman voronoi diagrams: Properties, algorithms and applications.arXiv preprint arXiv:0709.2196, 2007

Frank Nielsen, Jean-Daniel Boissonnat, and Richard Nock. Bregman voronoi diagrams: Properties, algorithms and applications.arXiv preprint arXiv:0709.2196, 2007

work page arXiv 2007

[64] [64]

tiger shark

TM Cover and JA Thomas. Elements of information theory 2nd ed., 2006. 13 Appendix A Proof of Theorem 1 Theorem 1.Let p∗(x, c) denote the reference distribution, and let the model parameters θbase have distributionp θbase(x, c)satisfying, for anyϵ >0, p∗(x, c)−p θbase(x, c) < ε. The model is adapted by gradient descent on the denoising loss LDenoise from E...

work page 2006