TextBoost: Boosting Text Encoder for Personalized Text-to-Image Generation

Hyunjung Shim; Kunhee Kim; NaHyeon Park

arxiv: 2409.08248 · v2 · pith:G5N4F7VRnew · submitted 2024-09-12 · 💻 cs.CV

TextBoost: Boosting Text Encoder for Personalized Text-to-Image Generation

NaHyeon Park , Kunhee Kim , Hyunjung Shim This is my paper

Pith reviewed 2026-05-23 20:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords personalized text-to-image generationtext encoder fine-tuningdiffusion modelsone-shot personalizationlightweight adapterscausality-preserving adaptationefficient adaptation

0 comments

The pith

TextBoost personalizes text-to-image models by fine-tuning only the text encoder with lightweight adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TextBoost as a way to adapt text-to-image diffusion models to new subjects using far less computation and storage than standard methods. Instead of updating most of the model, it updates only the text encoder while adding a mechanism that keeps original word meanings intact and small adapters that sharpen text signals right before they shape the image. This setup yields quicker training runs and much smaller stored models. Results match existing methods on how closely images match the reference subject yet improve on how well they follow the text prompt and how much variety they produce. A reader would care because full-model personalization often demands too much memory and time to be practical outside large labs.

Core claim

TextBoost is an efficient one-shot personalization approach for text-to-image diffusion models that selectively fine-tunes only the text encoder. A causality-preserving adaptation mechanism maintains the original semantic integrity of the encoder, while lightweight adapters locally refine text embeddings immediately before they reach the cross-attention layers. This design delivers faster convergence, substantially lower storage needs through fewer trainable parameters, comparable subject fidelity, superior text fidelity, and greater generation diversity relative to prior personalization techniques.

What carries the argument

Causality-preserving adaptation mechanism plus lightweight adapters applied directly to the text encoder, which enables selective fine-tuning while preserving semantics and boosting expressiveness with minimal added cost.

If this is right

Personalization training converges faster than methods that update larger portions of the model.
Storage requirements drop sharply because only a small number of parameters are trained and saved.
Subject fidelity remains comparable to heavier personalization baselines.
Text fidelity improves over existing approaches, allowing generated images to match prompts more accurately.
Output diversity increases, producing more varied images for the same subject and prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduced parameter count could make on-device personalization feasible on phones or laptops with limited memory.
The same selective update pattern might transfer to other conditioning signals such as depth maps or style references.
Lower storage per user profile would allow services to host many personalized models without proportional increases in disk use.
Faster convergence might shorten the time users wait between providing a reference image and receiving usable outputs.

Load-bearing premise

The causality-preserving adaptation and lightweight adapters can be added to the text encoder without introducing semantic drift or reducing the model's ability to follow complex prompts.

What would settle it

A controlled test in which TextBoost images show visibly lower fidelity to the reference subject or diverge from specific details in complex prompts compared with full-model fine-tuning baselines.

Figures

Figures reproduced from arXiv: 2409.08248 by Hyunjung Shim, Kunhee Kim, NaHyeon Park.

**Figure 2.** Figure 2: Method overview. We selectively fine-tune text encoder for one-shot personalization. We utilze three novel techniques to further boost the personalization performance. adaptation. To the best of our knowledge, our work is the first to focus exclusively on fine-tuning the text encoder for customized text-to-image generation. Furthermore, to enhance parameter efficiency, we adopt the Low-Rank Adaptation (LoR… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on Stable Diffusion v1.5. We compare images generated by each method using various types of text prompts on different subjects. All models are trained using a single reference image. CLIP-T ↑ CLIP-I ↑ CLIP-I ↑ # Params ↓ GPU M ↓ Storage ↓ Methods (seen) (unseen) (batch size=2) (per concept) DreamBooth 0.656 0.777 0.732 865.9 M 35.7 GB 3.3 GB Custom Diffusion 0.609 0.845 0.780 19.2 M … view at source ↗

**Figure 4.** Figure 4: Diversity comparison. (a) We calculate the inter-similarity of 100 generated images using the DINOv2 score and plot the distribution, given the same reference image and identical prompts. Blue and red horizontal lines indicate the median and mean of each distribution, respectively. (b) Qualitative examples of each method, with two subjects, each with two images per prompt. Note that for a fair comparison, … view at source ↗

**Figure 6.** Figure 6: Ablation on augmentation token. To test whether the augmentation token has learned the corresponding augmentation, we generate images with and without the augmentation token as the input prompt. We showcase a vertical flip as an example of intuitive visualization. tokens were learned effectively, we performed an ablation specifically on the augmentation token. In this analysis, we used a vertical flip to… view at source ↗

**Figure 7.** Figure 7: Stylization. We use a single style image (bottom left) as a reference to generate customized images. Conclusion In this paper, we aimed to develop high-quality personalized text-to-image generation method that enables creative control through text prompts, with a single reference image. Our TextBoost, which focuses on fine-tuning the text encoder with innovative training methods, effectively mitigates ov… view at source ↗

**Figure 8.** Figure 8: Examples of Augmentation leaking [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: B. Timestep Sampling Effect of text embedding per timestep. Previous studies (Choi et al. 2022; Balaji et al. 2022) have shown that diffusion models are can be formulated as a mixture-f-experts, with each timestep conditioned U-Net playing a different role. Notably, Balaji et al. (2022) demonstrated that diffusion models become less reliant on text input as noise level decreases. Inspired by these findi… view at source ↗

**Figure 10.** Figure 10: Augmentations used during training TextBoost [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Effect of text input. We measured the effect of the text conditioning by βt|ϵ(xt, ybase) − ϵ(xt, yother)|, where a base prompt ybase is ‘photo of a dog’. Here, βt, αt, and α¯t are predefined time-dependant scaling factors, we can see that the model output’s impact is scaled by βt/ √ 1 − α¯t. Consequently, we scaled the difference by βt. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: User study template. We ask 100 participants to answer 20 questions each on Amazon Mechanical Turk [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison with Custom Diffusion. More qualitative results on comparison with Custom Diffusion. Random seeds are fixed for fair comparison [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: More qualitative results of our TextBoost (dog) [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: More qualitative results of our TextBoost (cat) [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: More qualitative results of our TextBoost (several subjects) [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

read the original abstract

In this paper, we introduce TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models. Traditional personalization methods typically involve fine-tuning extensive portions of the model, leading to substantial storage requirements and slow convergence. In contrast, we propose selectively fine-tuning only the text encoder, significantly improving computational and storage efficiency. To preserve the original semantic integrity, we develop a novel causality-preserving adaptation mechanism. Additionally, lightweight adapters are employed to locally refine text embeddings immediately before their interaction with cross-attention layers, greatly enhancing the expressiveness of text embeddings with minimal computational overhead. Empirical evaluations across diverse concepts demonstrate that TextBoost achieves faster convergence and substantially reduces storage demands by minimizing the number of trainable parameters. Furthermore, TextBoost maintains comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods. We show that our proposed method offers an efficient, scalable, and practically applicable solution for high-quality text-to-image personalization, particularly beneficial in resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TextBoost is an incremental efficiency tweak on text-encoder fine-tuning for one-shot diffusion personalization, but the abstract supplies no numbers to check the performance claims.

read the letter

TextBoost is basically a way to personalize text-to-image models with less hassle by only adjusting the text encoder instead of larger parts of the diffusion model. They add a causality-preserving adaptation to keep the original meanings and use small adapters before the cross-attention layers to improve how text is handled. The new parts are the selective fine-tuning of the text encoder, that causality step, and the lightweight adapters placed right before cross-attention. This setup aims to lower the number of parameters that need training, which should speed things up and cut storage needs. The paper points out that this keeps subject fidelity similar while improving text following and variety in outputs. The approach makes sense for resource-limited cases. It builds on known techniques but applies them in a focused way to the text side. The main issue is that the abstract talks about empirical results across concepts but gives no actual numbers, tables, or comparisons. We don't see error bars, exact datasets, or how it stacks up against specific baselines like DreamBooth or LoRA variants. This makes it tough to know if the claimed advantages in convergence speed and fidelity are solid. The causality-preserving part is meant to prevent problems with prompt adherence, but without details on how it's implemented or tested, it's an open question. Readers who work on efficient fine-tuning for generative models would get the most out of this. It could be useful for someone trying to deploy personalization on edge devices or with limited GPUs. The paper shows clear thinking on the efficiency problem without contradictions in the description. It should go to peer review so reviewers can examine the experiments and see if the method delivers on the promises.

Referee Report

1 major / 0 minor

Summary. The paper introduces TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models by selectively fine-tuning only the text encoder. It develops a causality-preserving adaptation mechanism and employs lightweight adapters to refine text embeddings before cross-attention, claiming faster convergence, reduced storage via fewer trainable parameters, comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods.

Significance. If the empirical claims hold, the method offers a practical efficiency gain for personalization tasks by minimizing parameter updates and storage overhead while addressing semantic integrity, which could make high-quality customization more feasible in resource-limited environments.

major comments (1)

[Abstract] Abstract: The abstract reports empirical results across concepts but provides no quantitative tables, baselines, error bars, or dataset details; central performance claims cannot be verified from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment. We address the point regarding the abstract below and will incorporate revisions in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract reports empirical results across concepts but provides no quantitative tables, baselines, error bars, or dataset details; central performance claims cannot be verified from the given text.

Authors: We agree that the abstract, as currently written, summarizes empirical outcomes at a high level without including specific quantitative values, baselines, or dataset references, which limits verifiability from the abstract alone. While abstracts are inherently concise and full experimental details (including tables with metrics, baselines, error bars, and dataset descriptions) appear in Section 4 of the manuscript, we will revise the abstract to incorporate a small number of key quantitative highlights—such as approximate parameter reduction percentages, convergence speed improvements, and relative fidelity/diversity gains—drawn from the experimental results. This will better support the central claims without exceeding typical abstract length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method only

full rationale

The paper introduces TextBoost as an empirical one-shot personalization technique that selectively fine-tunes the text encoder plus lightweight adapters, supported by a causality-preserving adaptation mechanism. All performance claims (faster convergence, reduced storage, comparable subject fidelity, superior text fidelity) are framed as outcomes of empirical evaluations across diverse concepts rather than any derivation, equation, or fitted prediction. No self-referential fitting, self-citation load-bearing premises, or reductions of predictions to inputs by construction appear in the described approach. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method relies on standard diffusion-model components and empirical tuning.

pith-pipeline@v0.9.0 · 5700 in / 997 out tokens · 21959 ms · 2026-05-23T20:47:57.861391+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose selectively fine-tuning only the text encoder... augmentation token... knowledge-preservation loss... SNR-weighted sampling
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach is memory and storage-efficient, requiring only 0.7M parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adversarial Concept Distillation for One-Step Diffusion Personalization
cs.CV 2025-10 unverdicted novelty 6.0

OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Alaluf, Y.; Richardson, E.; Metzer, G.; and Cohen-Or, D. 2023. A Neural Space - Time Representation for Text -to- Image Personalization . ACM Transactions on Graphics (TOG), 42

work page 2023
[4]

Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; Catanzaro, B.; Karras, T.; and Liu, M.-Y. 2022. eDiff - I : Text -to- Image Diffusion Models with an Ensemble of Expert Denoisers . ArXiv:2211.01324 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Black, K.; Janner, M.; Du, Y.; Kostrikov, I.; and Levine, S. 2024. Training Diffusion Models with Reinforcement Learning

work page 2024
[6]

Brooks, T.; Holynski, A.; and Efros, A. A. 2023. InstructPix2Pix : Learning to Follow Image Editing Instructions . In IEEE Conference on Computer Vision and Pattern Recognition

work page 2023
[7]

Chen, H.; Zhang, Y.; Wu, S.; Wang, X.; Duan, X.; Zhou, Y.; and Zhu, W. 2024 a . DisenBooth : Identity - Preserving Disentangled Tuning for Subject - Driven Text -to- Image Generation . In International Conference on Learning Representations

work page 2024
[8]

Chen, W.; Hu, H.; Li, Y.; Ruiz, N.; Jia, X.; Chang, M.-W.; and Cohen, W. W. 2024 b . Subject-driven Text -to- Image Generation via Apprenticeship Learning . In Advances in Neural Information Processing Systems . ArXiv:2304.00186 [cs]

work page arXiv 2024
[9]

Choi, J.; Lee, J.; Shin, C.; Kim, S.; Kim, H.; and Yoon, S. 2022. Perception Prioritized Training of Diffusion Models . In CVPR . ArXiv:2204.00227 [cs]

work page arXiv 2022
[10]

Fan, Y.; Watkins, O.; Du, Y.; Liu, H.; Ryu, M.; Boutilier, C.; Abbeel, P.; Ghavamzadeh, M.; Lee, K.; and Lee, K. 2023. DPOK : Reinforcement Learning for Fine -tuning Text -to- Image Diffusion Models

work page 2023
[11]

H.; Chechik, G.; and Cohen-Or, D

Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A. H.; Chechik, G.; and Cohen-Or, D. 2023. An Image is Worth One Word : Personalizing Text -to- Image Generation using Textual Inversion . In International Conference on Learning Representations

work page 2023
[12]

Z.; Shi, Y.; Chen, Y.; Fan, Z.; Xiao, W.; Zhao, R.; Chang, S.; Wu, W.; Ge, Y.; Shan, Y.; and Shou, M

Gu, Y.; Wang, X.; Wu, J. Z.; Shi, Y.; Chen, Y.; Fan, Z.; Xiao, W.; Zhao, R.; Chang, S.; Wu, W.; Ge, Y.; Shan, Y.; and Shou, M. Z. 2023. Mix-of- Show : Decentralized Low - Rank Adaptation for Multi - Concept Customization of Diffusion Models . In Advances in Neural Information Processing Systems

work page 2023
[13]

Han, L.; Li, Y.; Zhang, H.; Milanfar, P.; Metaxas, D.; and Yang, F. 2023. SVDiff : Compact Parameter Space for Diffusion Fine - Tuning . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2023
[14]

He, X.; Cao, Z.; Kolkin, N.; Yu, L.; Wan, K.; Rhodin, H.; and Kalarot, R. 2023. A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization . ArXiv:2311.04315 [cs]

work page arXiv 2023
[15]

Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion Probabilistic Models . In Advances in Neural Information Processing Systems . ArXiv: 2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020
[16]

J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W

Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. LoRA : Low - Rank Adaptation of Large Language Models

work page 2022
[17]

Hua, M.; Liu, J.; Ding, F.; Liu, W.; Wu, J.; and He, Q. 2023. DreamTuner : Single Image is Enough for Subject - Driven Generation . ArXiv:2312.13691 [cs]

work page arXiv 2023
[18]

Jun, H.; Child, R.; Chen, M.; Schulman, J.; Ramesh, A.; Radford, A.; and Sutskever, I. 2020. Distribution Augmentation for Generative Modeling . In Proceedings of the 37th International Conference on Machine Learning , 5006--5019. PMLR

work page 2020
[19]

Kang, M.; Zhang, J.; Zhang, J.; Wang, X.; Chen, Y.; Ma, Z.; and Huang, X. 2023. Alleviating Catastrophic Forgetting of Incremental Object Detection via Within - Class and Between - Class Knowledge Distillation . In 2023 IEEE / CVF International Conference on Computer Vision ( ICCV ) , 18848--18858. Paris, France: IEEE. ISBN 9798350307184

work page 2023
[20]

Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; and Aila, T. 2020. Training Generative Adversarial Networks with Limited Data . In Advances in Neural Information Processing Systems

work page 2020
[21]

Variational diffusion models,

Kingma, D. P.; Salimans, T.; Poole, B.; and Ho, J. 2021. Variational Diffusion Models . In Advances in Neural Information Processing Systems . ArXiv:2107.00630 [cs, stat]

work page arXiv 2021
[22]

Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; and Zhu, J.-Y. 2023. Multi- Concept Customization of Text -to- Image Diffusion . In IEEE Conference on Computer Vision and Pattern Recognition

work page 2023
[23]

Lee, J.; Cho, K.; and Kiela, D. 2019. Countering Language Drift via Visual Grounding . In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , 4385--4395. Hong Kong, China: Associa...

work page 2019
[24]

Li, D.; Li, J.; and Hoi, S. C. H. 2023. BLIP - Diffusion : Pre -trained Subject Representation for Controllable Text -to- Image Generation and Editing . In Advances in Neural Information Processing Systems

work page 2023
[25]

C.; and Shechtman, E

Li, Y.; Zhang, R.; Lu, J. C.; and Shechtman, E. 2020. Few-shot Image Generation with Elastic Weight Consolidation . In Advances in Neural Information Processing Systems

work page 2020
[26]

Liu, Z.; Feng, R.; Zhu, K.; Zhang, Y.; Zheng, K.; Liu, Y.; Zhao, D.; Zhou, J.; and Cao, Y. 2023. Cones: Concept Neurons in Diffusion Models for Customized Generation . In Proceedings of the 40th International Conference on Machine Learning . PMLR

work page 2023
[27]

Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization . In International Conference on Learning Representations

work page 2019
[28]

Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; and Chen, M. 2023. GLIDE : Towards Photorealistic Image Generation and Editing with Text - Guided Diffusion Models . In Proceedings of the 39th International Conference on Machine Learning . PMLR

work page 2023
[29]

Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; Assran, M.; Ballas, N.; Galuba, W.; Howes, R.; Huang, P.-Y.; Li, S.-W.; Misra, I.; Rabbat, M.; Sharma, V.; Synnaeve, G.; Xu, H.; Jegou, H.; Mairal, J.; Labatut, P.; Joulin, A.; and Bojanowski, P. 2024. DINOv2 : Learning Robust V...

work page 2024
[30]

Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; and Rombach, R. 2024. SDXL : Improving Latent Diffusion Models for High - Resolution Image Synthesis . In International Conference on Learning Representations

work page 2024
[31]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision . In Proceedings of the 38th International Conference on Machine Learning . PMLR

work page 2021
[32]

Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022. Hierarchical Text - Conditional Image Generation with CLIP Latents

work page 2022
[33]

P.; and Wayne, G

Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T. P.; and Wayne, G. 2019. Experience Replay for Continual Learning . In Advances in Neural Information Processing Systems . arXiv. ArXiv:1811.11682 [cs, stat]

work page arXiv 2019
[34]

Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High- Resolution Image Synthesis with Latent Diffusion Models . In IEEE Conference on Computer Vision and Pattern Recognition

work page 2022
[35]

Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U- Net : Convolutional Networks for Biomedical Image Segmentation . In International Conference on Medical Image Computing and Computer - Assisted Intervention

work page 2015
[36]

Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; and Aberman, K. 2023. DreamBooth : Fine Tuning Text -to- Image Diffusion Models for Subject - Driven Generation . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . ArXiv:2208.12242 [cs]

work page arXiv 2023
[37]

Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S. K. S.; Ayan, B. K.; Mahdavi, S. S.; Lopes, R. G.; Salimans, T.; Ho, J.; Fleet, D. J.; and Norouzi, M. 2022. Photorealistic Text -to- Image Diffusion Models with Deep Language Understanding . In Advances in Neural Information Processing Systems

work page 2022
[38]

Sauer, A.; Boesel, F.; Dockhorn, T.; Blattmann, A.; Esser, P.; and Rombach, R. 2024. Fast High - Resolution Image Synthesis with Latent Adversarial Diffusion Distillation . ArXiv:2403.12015 [cs]

work page arXiv 2024
[39]

Shi, J.; Xiong, W.; Lin, Z.; and Jung, H. J. 2024. InstantBooth : Personalized Text -to- Image Generation without Test - Time Finetuning . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2024
[40]

A.; Maheswaranathan, N.; and Ganguli, S

Sohl-Dickstein, J.; Weiss, E. A.; Maheswaranathan, N.; and Ganguli, S. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics . In Proceedings of the 32nd International Conference on Machine Learning , 2256--2265. PMLR

work page 2015
[41]

C.; Blok, I.; Chang, H.; Barber, J.; Jiang, L.; Entis, G.; Li, Y.; Hao, Y.; Essa, I.; Rubinstein, M.; and Krishnan, D

Sohn, K.; Ruiz, N.; Lee, K.; Chin, D. C.; Blok, I.; Chang, H.; Barber, J.; Jiang, L.; Entis, G.; Li, Y.; Hao, Y.; Essa, I.; Rubinstein, M.; and Krishnan, D. 2023. StyleDrop : Text -to- Image Generation in Any Style . In Advances in Neural Information Processing Systems

work page 2023
[42]

Tewel, Y.; Gal, R.; Chechik, G.; and Atzmon, Y. 2023. Key- Locked Rank One Editing for Text -to- Image Personalization . ACM SIGGRAPH 2023 Conference Proceedings

work page 2023
[43]

Voynov, A.; Chu, Q.; Cohen-Or, D.; and Aberman, K. 2023. P+: Extended Textual Conditioning in Text -to- Image Generation . ArXiv:2303.09522 [cs]

work page arXiv 2023
[44]

Wang, Z.; Wei, W.; Zhao, Y.; Xiao, Z.; Hasegawa-Johnson, M.; Shi, H.; and Hou, T. 2023. HiFi Tuner : High - Fidelity Subject - Driven Fine - Tuning for Diffusion Models . ArXiv:2312.00079 [cs]

work page arXiv 2023
[45]

Wei, Y.; Zhang, Y.; Ji, Z.; Bai, J.; Zhang, L.; and Zuo, W. 2023. ELITE : Encoding Visual Concepts into Textual Embeddings for Customized Text -to- Image Generation . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2023
[46]

T.; Durand, F.; and Han, S

Xiao, G.; Yin, T.; Freeman, W. T.; Durand, F.; and Han, S. 2023. FastComposer : Tuning - Free Multi - Subject Image Generation with Localized Attention . ArXiv:2305.10431 [cs]

work page arXiv 2023
[47]

Zhang, X.; Wei, X.-Y.; Wu, J.; Zhang, T.; Zhang, Z.; Lei, Z.; and Li, Q. 2024 a . Compositional Inversion for Stable Diffusion Models . In Proceedings of the AAAI Conference on Artificial Intelligence

work page 2024
[48]

Zhang, X.; Wei, X.-Y.; Zhang, W.; Wu, J.; Zhang, Z.; Lei, Z.; and Li, Q. 2024 b . A Survey on Personalized Content Synthesis with Diffusion Models . ArXiv:2405.05538 [cs]

work page arXiv 2024
[49]

Zhang, Y.; Yang, M.; Zhou, Q.; and Wang, Z. 2024 c . Attention Calibration for Disentangled Text -to- Image Personalization . In IEEE Conference on Computer Vision and Pattern Recognition

work page 2024

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Alaluf, Y.; Richardson, E.; Metzer, G.; and Cohen-Or, D. 2023. A Neural Space - Time Representation for Text -to- Image Personalization . ACM Transactions on Graphics (TOG), 42

work page 2023

[4] [4]

Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; Catanzaro, B.; Karras, T.; and Liu, M.-Y. 2022. eDiff - I : Text -to- Image Diffusion Models with an Ensemble of Expert Denoisers . ArXiv:2211.01324 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Black, K.; Janner, M.; Du, Y.; Kostrikov, I.; and Levine, S. 2024. Training Diffusion Models with Reinforcement Learning

work page 2024

[6] [6]

Brooks, T.; Holynski, A.; and Efros, A. A. 2023. InstructPix2Pix : Learning to Follow Image Editing Instructions . In IEEE Conference on Computer Vision and Pattern Recognition

work page 2023

[7] [7]

Chen, H.; Zhang, Y.; Wu, S.; Wang, X.; Duan, X.; Zhou, Y.; and Zhu, W. 2024 a . DisenBooth : Identity - Preserving Disentangled Tuning for Subject - Driven Text -to- Image Generation . In International Conference on Learning Representations

work page 2024

[8] [8]

Chen, W.; Hu, H.; Li, Y.; Ruiz, N.; Jia, X.; Chang, M.-W.; and Cohen, W. W. 2024 b . Subject-driven Text -to- Image Generation via Apprenticeship Learning . In Advances in Neural Information Processing Systems . ArXiv:2304.00186 [cs]

work page arXiv 2024

[9] [9]

Choi, J.; Lee, J.; Shin, C.; Kim, S.; Kim, H.; and Yoon, S. 2022. Perception Prioritized Training of Diffusion Models . In CVPR . ArXiv:2204.00227 [cs]

work page arXiv 2022

[10] [10]

Fan, Y.; Watkins, O.; Du, Y.; Liu, H.; Ryu, M.; Boutilier, C.; Abbeel, P.; Ghavamzadeh, M.; Lee, K.; and Lee, K. 2023. DPOK : Reinforcement Learning for Fine -tuning Text -to- Image Diffusion Models

work page 2023

[11] [11]

H.; Chechik, G.; and Cohen-Or, D

Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A. H.; Chechik, G.; and Cohen-Or, D. 2023. An Image is Worth One Word : Personalizing Text -to- Image Generation using Textual Inversion . In International Conference on Learning Representations

work page 2023

[12] [12]

Z.; Shi, Y.; Chen, Y.; Fan, Z.; Xiao, W.; Zhao, R.; Chang, S.; Wu, W.; Ge, Y.; Shan, Y.; and Shou, M

Gu, Y.; Wang, X.; Wu, J. Z.; Shi, Y.; Chen, Y.; Fan, Z.; Xiao, W.; Zhao, R.; Chang, S.; Wu, W.; Ge, Y.; Shan, Y.; and Shou, M. Z. 2023. Mix-of- Show : Decentralized Low - Rank Adaptation for Multi - Concept Customization of Diffusion Models . In Advances in Neural Information Processing Systems

work page 2023

[13] [13]

Han, L.; Li, Y.; Zhang, H.; Milanfar, P.; Metaxas, D.; and Yang, F. 2023. SVDiff : Compact Parameter Space for Diffusion Fine - Tuning . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2023

[14] [14]

He, X.; Cao, Z.; Kolkin, N.; Yu, L.; Wan, K.; Rhodin, H.; and Kalarot, R. 2023. A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization . ArXiv:2311.04315 [cs]

work page arXiv 2023

[15] [15]

Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion Probabilistic Models . In Advances in Neural Information Processing Systems . ArXiv: 2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020

[16] [16]

J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W

Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. LoRA : Low - Rank Adaptation of Large Language Models

work page 2022

[17] [17]

Hua, M.; Liu, J.; Ding, F.; Liu, W.; Wu, J.; and He, Q. 2023. DreamTuner : Single Image is Enough for Subject - Driven Generation . ArXiv:2312.13691 [cs]

work page arXiv 2023

[18] [18]

Jun, H.; Child, R.; Chen, M.; Schulman, J.; Ramesh, A.; Radford, A.; and Sutskever, I. 2020. Distribution Augmentation for Generative Modeling . In Proceedings of the 37th International Conference on Machine Learning , 5006--5019. PMLR

work page 2020

[19] [19]

Kang, M.; Zhang, J.; Zhang, J.; Wang, X.; Chen, Y.; Ma, Z.; and Huang, X. 2023. Alleviating Catastrophic Forgetting of Incremental Object Detection via Within - Class and Between - Class Knowledge Distillation . In 2023 IEEE / CVF International Conference on Computer Vision ( ICCV ) , 18848--18858. Paris, France: IEEE. ISBN 9798350307184

work page 2023

[20] [20]

Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; and Aila, T. 2020. Training Generative Adversarial Networks with Limited Data . In Advances in Neural Information Processing Systems

work page 2020

[21] [21]

Variational diffusion models,

Kingma, D. P.; Salimans, T.; Poole, B.; and Ho, J. 2021. Variational Diffusion Models . In Advances in Neural Information Processing Systems . ArXiv:2107.00630 [cs, stat]

work page arXiv 2021

[22] [22]

Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; and Zhu, J.-Y. 2023. Multi- Concept Customization of Text -to- Image Diffusion . In IEEE Conference on Computer Vision and Pattern Recognition

work page 2023

[23] [23]

Lee, J.; Cho, K.; and Kiela, D. 2019. Countering Language Drift via Visual Grounding . In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , 4385--4395. Hong Kong, China: Associa...

work page 2019

[24] [24]

Li, D.; Li, J.; and Hoi, S. C. H. 2023. BLIP - Diffusion : Pre -trained Subject Representation for Controllable Text -to- Image Generation and Editing . In Advances in Neural Information Processing Systems

work page 2023

[25] [25]

C.; and Shechtman, E

Li, Y.; Zhang, R.; Lu, J. C.; and Shechtman, E. 2020. Few-shot Image Generation with Elastic Weight Consolidation . In Advances in Neural Information Processing Systems

work page 2020

[26] [26]

Liu, Z.; Feng, R.; Zhu, K.; Zhang, Y.; Zheng, K.; Liu, Y.; Zhao, D.; Zhou, J.; and Cao, Y. 2023. Cones: Concept Neurons in Diffusion Models for Customized Generation . In Proceedings of the 40th International Conference on Machine Learning . PMLR

work page 2023

[27] [27]

Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization . In International Conference on Learning Representations

work page 2019

[28] [28]

Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; and Chen, M. 2023. GLIDE : Towards Photorealistic Image Generation and Editing with Text - Guided Diffusion Models . In Proceedings of the 39th International Conference on Machine Learning . PMLR

work page 2023

[29] [29]

Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; Assran, M.; Ballas, N.; Galuba, W.; Howes, R.; Huang, P.-Y.; Li, S.-W.; Misra, I.; Rabbat, M.; Sharma, V.; Synnaeve, G.; Xu, H.; Jegou, H.; Mairal, J.; Labatut, P.; Joulin, A.; and Bojanowski, P. 2024. DINOv2 : Learning Robust V...

work page 2024

[30] [30]

Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; and Rombach, R. 2024. SDXL : Improving Latent Diffusion Models for High - Resolution Image Synthesis . In International Conference on Learning Representations

work page 2024

[31] [31]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision . In Proceedings of the 38th International Conference on Machine Learning . PMLR

work page 2021

[32] [32]

Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022. Hierarchical Text - Conditional Image Generation with CLIP Latents

work page 2022

[33] [33]

P.; and Wayne, G

Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T. P.; and Wayne, G. 2019. Experience Replay for Continual Learning . In Advances in Neural Information Processing Systems . arXiv. ArXiv:1811.11682 [cs, stat]

work page arXiv 2019

[34] [34]

Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High- Resolution Image Synthesis with Latent Diffusion Models . In IEEE Conference on Computer Vision and Pattern Recognition

work page 2022

[35] [35]

Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U- Net : Convolutional Networks for Biomedical Image Segmentation . In International Conference on Medical Image Computing and Computer - Assisted Intervention

work page 2015

[36] [36]

Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; and Aberman, K. 2023. DreamBooth : Fine Tuning Text -to- Image Diffusion Models for Subject - Driven Generation . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . ArXiv:2208.12242 [cs]

work page arXiv 2023

[37] [37]

Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S. K. S.; Ayan, B. K.; Mahdavi, S. S.; Lopes, R. G.; Salimans, T.; Ho, J.; Fleet, D. J.; and Norouzi, M. 2022. Photorealistic Text -to- Image Diffusion Models with Deep Language Understanding . In Advances in Neural Information Processing Systems

work page 2022

[38] [38]

Sauer, A.; Boesel, F.; Dockhorn, T.; Blattmann, A.; Esser, P.; and Rombach, R. 2024. Fast High - Resolution Image Synthesis with Latent Adversarial Diffusion Distillation . ArXiv:2403.12015 [cs]

work page arXiv 2024

[39] [39]

Shi, J.; Xiong, W.; Lin, Z.; and Jung, H. J. 2024. InstantBooth : Personalized Text -to- Image Generation without Test - Time Finetuning . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2024

[40] [40]

A.; Maheswaranathan, N.; and Ganguli, S

Sohl-Dickstein, J.; Weiss, E. A.; Maheswaranathan, N.; and Ganguli, S. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics . In Proceedings of the 32nd International Conference on Machine Learning , 2256--2265. PMLR

work page 2015

[41] [41]

C.; Blok, I.; Chang, H.; Barber, J.; Jiang, L.; Entis, G.; Li, Y.; Hao, Y.; Essa, I.; Rubinstein, M.; and Krishnan, D

Sohn, K.; Ruiz, N.; Lee, K.; Chin, D. C.; Blok, I.; Chang, H.; Barber, J.; Jiang, L.; Entis, G.; Li, Y.; Hao, Y.; Essa, I.; Rubinstein, M.; and Krishnan, D. 2023. StyleDrop : Text -to- Image Generation in Any Style . In Advances in Neural Information Processing Systems

work page 2023

[42] [42]

Tewel, Y.; Gal, R.; Chechik, G.; and Atzmon, Y. 2023. Key- Locked Rank One Editing for Text -to- Image Personalization . ACM SIGGRAPH 2023 Conference Proceedings

work page 2023

[43] [43]

Voynov, A.; Chu, Q.; Cohen-Or, D.; and Aberman, K. 2023. P+: Extended Textual Conditioning in Text -to- Image Generation . ArXiv:2303.09522 [cs]

work page arXiv 2023

[44] [44]

Wang, Z.; Wei, W.; Zhao, Y.; Xiao, Z.; Hasegawa-Johnson, M.; Shi, H.; and Hou, T. 2023. HiFi Tuner : High - Fidelity Subject - Driven Fine - Tuning for Diffusion Models . ArXiv:2312.00079 [cs]

work page arXiv 2023

[45] [45]

Wei, Y.; Zhang, Y.; Ji, Z.; Bai, J.; Zhang, L.; and Zuo, W. 2023. ELITE : Encoding Visual Concepts into Textual Embeddings for Customized Text -to- Image Generation . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2023

[46] [46]

T.; Durand, F.; and Han, S

Xiao, G.; Yin, T.; Freeman, W. T.; Durand, F.; and Han, S. 2023. FastComposer : Tuning - Free Multi - Subject Image Generation with Localized Attention . ArXiv:2305.10431 [cs]

work page arXiv 2023

[47] [47]

Zhang, X.; Wei, X.-Y.; Wu, J.; Zhang, T.; Zhang, Z.; Lei, Z.; and Li, Q. 2024 a . Compositional Inversion for Stable Diffusion Models . In Proceedings of the AAAI Conference on Artificial Intelligence

work page 2024

[48] [48]

Zhang, X.; Wei, X.-Y.; Zhang, W.; Wu, J.; Zhang, Z.; Lei, Z.; and Li, Q. 2024 b . A Survey on Personalized Content Synthesis with Diffusion Models . ArXiv:2405.05538 [cs]

work page arXiv 2024

[49] [49]

Zhang, Y.; Yang, M.; Zhou, Q.; and Wang, Z. 2024 c . Attention Calibration for Disentangled Text -to- Image Personalization . In IEEE Conference on Computer Vision and Pattern Recognition

work page 2024