pith. sign in

arxiv: 2409.08248 · v2 · pith:G5N4F7VRnew · submitted 2024-09-12 · 💻 cs.CV

TextBoost: Boosting Text Encoder for Personalized Text-to-Image Generation

Pith reviewed 2026-05-23 20:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords personalized text-to-image generationtext encoder fine-tuningdiffusion modelsone-shot personalizationlightweight adapterscausality-preserving adaptationefficient adaptation
0
0 comments X

The pith

TextBoost personalizes text-to-image models by fine-tuning only the text encoder with lightweight adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TextBoost as a way to adapt text-to-image diffusion models to new subjects using far less computation and storage than standard methods. Instead of updating most of the model, it updates only the text encoder while adding a mechanism that keeps original word meanings intact and small adapters that sharpen text signals right before they shape the image. This setup yields quicker training runs and much smaller stored models. Results match existing methods on how closely images match the reference subject yet improve on how well they follow the text prompt and how much variety they produce. A reader would care because full-model personalization often demands too much memory and time to be practical outside large labs.

Core claim

TextBoost is an efficient one-shot personalization approach for text-to-image diffusion models that selectively fine-tunes only the text encoder. A causality-preserving adaptation mechanism maintains the original semantic integrity of the encoder, while lightweight adapters locally refine text embeddings immediately before they reach the cross-attention layers. This design delivers faster convergence, substantially lower storage needs through fewer trainable parameters, comparable subject fidelity, superior text fidelity, and greater generation diversity relative to prior personalization techniques.

What carries the argument

Causality-preserving adaptation mechanism plus lightweight adapters applied directly to the text encoder, which enables selective fine-tuning while preserving semantics and boosting expressiveness with minimal added cost.

If this is right

  • Personalization training converges faster than methods that update larger portions of the model.
  • Storage requirements drop sharply because only a small number of parameters are trained and saved.
  • Subject fidelity remains comparable to heavier personalization baselines.
  • Text fidelity improves over existing approaches, allowing generated images to match prompts more accurately.
  • Output diversity increases, producing more varied images for the same subject and prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reduced parameter count could make on-device personalization feasible on phones or laptops with limited memory.
  • The same selective update pattern might transfer to other conditioning signals such as depth maps or style references.
  • Lower storage per user profile would allow services to host many personalized models without proportional increases in disk use.
  • Faster convergence might shorten the time users wait between providing a reference image and receiving usable outputs.

Load-bearing premise

The causality-preserving adaptation and lightweight adapters can be added to the text encoder without introducing semantic drift or reducing the model's ability to follow complex prompts.

What would settle it

A controlled test in which TextBoost images show visibly lower fidelity to the reference subject or diverge from specific details in complex prompts compared with full-model fine-tuning baselines.

Figures

Figures reproduced from arXiv: 2409.08248 by Hyunjung Shim, Kunhee Kim, NaHyeon Park.

Figure 2
Figure 2. Figure 2: Method overview. We selectively fine-tune text encoder for one-shot personalization. We utilze three novel techniques to further boost the personalization performance. adaptation. To the best of our knowledge, our work is the first to focus exclusively on fine-tuning the text encoder for customized text-to-image generation. Furthermore, to enhance parameter efficiency, we adopt the Low-Rank Adaptation (LoR… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on Stable Diffusion v1.5. We compare images generated by each method using various types of text prompts on different subjects. All models are trained using a single reference image. CLIP-T ↑ CLIP-I ↑ CLIP-I ↑ # Params ↓ GPU M ↓ Storage ↓ Methods (seen) (unseen) (batch size=2) (per concept) DreamBooth 0.656 0.777 0.732 865.9 M 35.7 GB 3.3 GB Custom Diffusion 0.609 0.845 0.780 19.2 M … view at source ↗
Figure 4
Figure 4. Figure 4: Diversity comparison. (a) We calculate the inter-similarity of 100 generated images using the DINOv2 score and plot the distribution, given the same reference image and identical prompts. Blue and red horizontal lines indicate the median and mean of each distribution, respectively. (b) Qualitative examples of each method, with two subjects, each with two images per prompt. Note that for a fair comparison, … view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on augmentation token. To test whether the augmentation token has learned the corresponding aug￾mentation, we generate images with and without the aug￾mentation token as the input prompt. We showcase a vertical flip as an example of intuitive visualization. tokens were learned effectively, we performed an ablation specifically on the augmentation token. In this analysis, we used a vertical flip to… view at source ↗
Figure 7
Figure 7. Figure 7: Stylization. We use a single style image (bottom left) as a reference to generate customized images. Conclusion In this paper, we aimed to develop high-quality personal￾ized text-to-image generation method that enables creative control through text prompts, with a single reference image. Our TextBoost, which focuses on fine-tuning the text en￾coder with innovative training methods, effectively mitigates ov… view at source ↗
Figure 8
Figure 8. Figure 8: Examples of Augmentation leaking [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: B. Timestep Sampling Effect of text embedding per timestep. Previous studies (Choi et al. 2022; Balaji et al. 2022) have shown that diffu￾sion models are can be formulated as a mixture-f-experts, with each timestep conditioned U-Net playing a different role. Notably, Balaji et al. (2022) demonstrated that dif￾fusion models become less reliant on text input as noise level decreases. Inspired by these findi… view at source ↗
Figure 10
Figure 10. Figure 10: Augmentations used during training TextBoost [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Effect of text input. We measured the effect of the text conditioning by βt|ϵ(xt, ybase) − ϵ(xt, yother)|, where a base prompt ybase is ‘photo of a dog’. Here, βt, αt, and α¯t are predefined time-dependant scaling factors, we can see that the model output’s impact is scaled by βt/ √ 1 − α¯t. Consequently, we scaled the difference by βt. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: User study template. We ask 100 participants to answer 20 questions each on Amazon Mechanical Turk [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison with Custom Diffusion. More qualitative results on comparison with Custom Diffusion. Random seeds are fixed for fair comparison [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: More qualitative results of our TextBoost (dog) [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: More qualitative results of our TextBoost (cat) [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: More qualitative results of our TextBoost (several subjects) [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
read the original abstract

In this paper, we introduce TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models. Traditional personalization methods typically involve fine-tuning extensive portions of the model, leading to substantial storage requirements and slow convergence. In contrast, we propose selectively fine-tuning only the text encoder, significantly improving computational and storage efficiency. To preserve the original semantic integrity, we develop a novel causality-preserving adaptation mechanism. Additionally, lightweight adapters are employed to locally refine text embeddings immediately before their interaction with cross-attention layers, greatly enhancing the expressiveness of text embeddings with minimal computational overhead. Empirical evaluations across diverse concepts demonstrate that TextBoost achieves faster convergence and substantially reduces storage demands by minimizing the number of trainable parameters. Furthermore, TextBoost maintains comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods. We show that our proposed method offers an efficient, scalable, and practically applicable solution for high-quality text-to-image personalization, particularly beneficial in resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models by selectively fine-tuning only the text encoder. It develops a causality-preserving adaptation mechanism and employs lightweight adapters to refine text embeddings before cross-attention, claiming faster convergence, reduced storage via fewer trainable parameters, comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods.

Significance. If the empirical claims hold, the method offers a practical efficiency gain for personalization tasks by minimizing parameter updates and storage overhead while addressing semantic integrity, which could make high-quality customization more feasible in resource-limited environments.

major comments (1)
  1. [Abstract] Abstract: The abstract reports empirical results across concepts but provides no quantitative tables, baselines, error bars, or dataset details; central performance claims cannot be verified from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment. We address the point regarding the abstract below and will incorporate revisions in the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract reports empirical results across concepts but provides no quantitative tables, baselines, error bars, or dataset details; central performance claims cannot be verified from the given text.

    Authors: We agree that the abstract, as currently written, summarizes empirical outcomes at a high level without including specific quantitative values, baselines, or dataset references, which limits verifiability from the abstract alone. While abstracts are inherently concise and full experimental details (including tables with metrics, baselines, error bars, and dataset descriptions) appear in Section 4 of the manuscript, we will revise the abstract to incorporate a small number of key quantitative highlights—such as approximate parameter reduction percentages, convergence speed improvements, and relative fidelity/diversity gains—drawn from the experimental results. This will better support the central claims without exceeding typical abstract length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method only

full rationale

The paper introduces TextBoost as an empirical one-shot personalization technique that selectively fine-tunes the text encoder plus lightweight adapters, supported by a causality-preserving adaptation mechanism. All performance claims (faster convergence, reduced storage, comparable subject fidelity, superior text fidelity) are framed as outcomes of empirical evaluations across diverse concepts rather than any derivation, equation, or fitted prediction. No self-referential fitting, self-citation load-bearing premises, or reductions of predictions to inputs by construction appear in the described approach. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method relies on standard diffusion-model components and empirical tuning.

pith-pipeline@v0.9.0 · 5700 in / 997 out tokens · 21959 ms · 2026-05-23T20:47:57.861391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adversarial Concept Distillation for One-Step Diffusion Personalization

    cs.CV 2025-10 unverdicted novelty 6.0

    OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Alaluf, Y.; Richardson, E.; Metzer, G.; and Cohen-Or, D. 2023. A Neural Space - Time Representation for Text -to- Image Personalization . ACM Transactions on Graphics (TOG), 42

  4. [4]

    Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; Catanzaro, B.; Karras, T.; and Liu, M.-Y. 2022. eDiff - I : Text -to- Image Diffusion Models with an Ensemble of Expert Denoisers . ArXiv:2211.01324 [cs]

  5. [5]

    Black, K.; Janner, M.; Du, Y.; Kostrikov, I.; and Levine, S. 2024. Training Diffusion Models with Reinforcement Learning

  6. [6]

    Brooks, T.; Holynski, A.; and Efros, A. A. 2023. InstructPix2Pix : Learning to Follow Image Editing Instructions . In IEEE Conference on Computer Vision and Pattern Recognition

  7. [7]

    Chen, H.; Zhang, Y.; Wu, S.; Wang, X.; Duan, X.; Zhou, Y.; and Zhu, W. 2024 a . DisenBooth : Identity - Preserving Disentangled Tuning for Subject - Driven Text -to- Image Generation . In International Conference on Learning Representations

  8. [8]

    Chen, W.; Hu, H.; Li, Y.; Ruiz, N.; Jia, X.; Chang, M.-W.; and Cohen, W. W. 2024 b . Subject-driven Text -to- Image Generation via Apprenticeship Learning . In Advances in Neural Information Processing Systems . ArXiv:2304.00186 [cs]

  9. [9]

    Choi, J.; Lee, J.; Shin, C.; Kim, S.; Kim, H.; and Yoon, S. 2022. Perception Prioritized Training of Diffusion Models . In CVPR . ArXiv:2204.00227 [cs]

  10. [10]

    Fan, Y.; Watkins, O.; Du, Y.; Liu, H.; Ryu, M.; Boutilier, C.; Abbeel, P.; Ghavamzadeh, M.; Lee, K.; and Lee, K. 2023. DPOK : Reinforcement Learning for Fine -tuning Text -to- Image Diffusion Models

  11. [11]

    H.; Chechik, G.; and Cohen-Or, D

    Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A. H.; Chechik, G.; and Cohen-Or, D. 2023. An Image is Worth One Word : Personalizing Text -to- Image Generation using Textual Inversion . In International Conference on Learning Representations

  12. [12]

    Z.; Shi, Y.; Chen, Y.; Fan, Z.; Xiao, W.; Zhao, R.; Chang, S.; Wu, W.; Ge, Y.; Shan, Y.; and Shou, M

    Gu, Y.; Wang, X.; Wu, J. Z.; Shi, Y.; Chen, Y.; Fan, Z.; Xiao, W.; Zhao, R.; Chang, S.; Wu, W.; Ge, Y.; Shan, Y.; and Shou, M. Z. 2023. Mix-of- Show : Decentralized Low - Rank Adaptation for Multi - Concept Customization of Diffusion Models . In Advances in Neural Information Processing Systems

  13. [13]

    Han, L.; Li, Y.; Zhang, H.; Milanfar, P.; Metaxas, D.; and Yang, F. 2023. SVDiff : Compact Parameter Space for Diffusion Fine - Tuning . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

  14. [14]

    He, X.; Cao, Z.; Kolkin, N.; Yu, L.; Wan, K.; Rhodin, H.; and Kalarot, R. 2023. A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization . ArXiv:2311.04315 [cs]

  15. [15]

    Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion Probabilistic Models . In Advances in Neural Information Processing Systems . ArXiv: 2006.11239

  16. [16]

    J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W

    Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. LoRA : Low - Rank Adaptation of Large Language Models

  17. [17]

    Hua, M.; Liu, J.; Ding, F.; Liu, W.; Wu, J.; and He, Q. 2023. DreamTuner : Single Image is Enough for Subject - Driven Generation . ArXiv:2312.13691 [cs]

  18. [18]

    Jun, H.; Child, R.; Chen, M.; Schulman, J.; Ramesh, A.; Radford, A.; and Sutskever, I. 2020. Distribution Augmentation for Generative Modeling . In Proceedings of the 37th International Conference on Machine Learning , 5006--5019. PMLR

  19. [19]

    Kang, M.; Zhang, J.; Zhang, J.; Wang, X.; Chen, Y.; Ma, Z.; and Huang, X. 2023. Alleviating Catastrophic Forgetting of Incremental Object Detection via Within - Class and Between - Class Knowledge Distillation . In 2023 IEEE / CVF International Conference on Computer Vision ( ICCV ) , 18848--18858. Paris, France: IEEE. ISBN 9798350307184

  20. [20]

    Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; and Aila, T. 2020. Training Generative Adversarial Networks with Limited Data . In Advances in Neural Information Processing Systems

  21. [21]

    Variational diffusion models,

    Kingma, D. P.; Salimans, T.; Poole, B.; and Ho, J. 2021. Variational Diffusion Models . In Advances in Neural Information Processing Systems . ArXiv:2107.00630 [cs, stat]

  22. [22]

    Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; and Zhu, J.-Y. 2023. Multi- Concept Customization of Text -to- Image Diffusion . In IEEE Conference on Computer Vision and Pattern Recognition

  23. [23]

    Lee, J.; Cho, K.; and Kiela, D. 2019. Countering Language Drift via Visual Grounding . In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , 4385--4395. Hong Kong, China: Associa...

  24. [24]

    Li, D.; Li, J.; and Hoi, S. C. H. 2023. BLIP - Diffusion : Pre -trained Subject Representation for Controllable Text -to- Image Generation and Editing . In Advances in Neural Information Processing Systems

  25. [25]

    C.; and Shechtman, E

    Li, Y.; Zhang, R.; Lu, J. C.; and Shechtman, E. 2020. Few-shot Image Generation with Elastic Weight Consolidation . In Advances in Neural Information Processing Systems

  26. [26]

    Liu, Z.; Feng, R.; Zhu, K.; Zhang, Y.; Zheng, K.; Liu, Y.; Zhao, D.; Zhou, J.; and Cao, Y. 2023. Cones: Concept Neurons in Diffusion Models for Customized Generation . In Proceedings of the 40th International Conference on Machine Learning . PMLR

  27. [27]

    Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization . In International Conference on Learning Representations

  28. [28]

    Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; and Chen, M. 2023. GLIDE : Towards Photorealistic Image Generation and Editing with Text - Guided Diffusion Models . In Proceedings of the 39th International Conference on Machine Learning . PMLR

  29. [29]

    Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; Assran, M.; Ballas, N.; Galuba, W.; Howes, R.; Huang, P.-Y.; Li, S.-W.; Misra, I.; Rabbat, M.; Sharma, V.; Synnaeve, G.; Xu, H.; Jegou, H.; Mairal, J.; Labatut, P.; Joulin, A.; and Bojanowski, P. 2024. DINOv2 : Learning Robust V...

  30. [30]

    Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; and Rombach, R. 2024. SDXL : Improving Latent Diffusion Models for High - Resolution Image Synthesis . In International Conference on Learning Representations

  31. [31]

    W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I

    Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision . In Proceedings of the 38th International Conference on Machine Learning . PMLR

  32. [32]

    Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022. Hierarchical Text - Conditional Image Generation with CLIP Latents

  33. [33]

    P.; and Wayne, G

    Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T. P.; and Wayne, G. 2019. Experience Replay for Continual Learning . In Advances in Neural Information Processing Systems . arXiv. ArXiv:1811.11682 [cs, stat]

  34. [34]

    Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High- Resolution Image Synthesis with Latent Diffusion Models . In IEEE Conference on Computer Vision and Pattern Recognition

  35. [35]

    Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U- Net : Convolutional Networks for Biomedical Image Segmentation . In International Conference on Medical Image Computing and Computer - Assisted Intervention

  36. [36]

    Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; and Aberman, K. 2023. DreamBooth : Fine Tuning Text -to- Image Diffusion Models for Subject - Driven Generation . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . ArXiv:2208.12242 [cs]

  37. [37]

    Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S. K. S.; Ayan, B. K.; Mahdavi, S. S.; Lopes, R. G.; Salimans, T.; Ho, J.; Fleet, D. J.; and Norouzi, M. 2022. Photorealistic Text -to- Image Diffusion Models with Deep Language Understanding . In Advances in Neural Information Processing Systems

  38. [38]

    Sauer, A.; Boesel, F.; Dockhorn, T.; Blattmann, A.; Esser, P.; and Rombach, R. 2024. Fast High - Resolution Image Synthesis with Latent Adversarial Diffusion Distillation . ArXiv:2403.12015 [cs]

  39. [39]

    Shi, J.; Xiong, W.; Lin, Z.; and Jung, H. J. 2024. InstantBooth : Personalized Text -to- Image Generation without Test - Time Finetuning . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

  40. [40]

    A.; Maheswaranathan, N.; and Ganguli, S

    Sohl-Dickstein, J.; Weiss, E. A.; Maheswaranathan, N.; and Ganguli, S. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics . In Proceedings of the 32nd International Conference on Machine Learning , 2256--2265. PMLR

  41. [41]

    C.; Blok, I.; Chang, H.; Barber, J.; Jiang, L.; Entis, G.; Li, Y.; Hao, Y.; Essa, I.; Rubinstein, M.; and Krishnan, D

    Sohn, K.; Ruiz, N.; Lee, K.; Chin, D. C.; Blok, I.; Chang, H.; Barber, J.; Jiang, L.; Entis, G.; Li, Y.; Hao, Y.; Essa, I.; Rubinstein, M.; and Krishnan, D. 2023. StyleDrop : Text -to- Image Generation in Any Style . In Advances in Neural Information Processing Systems

  42. [42]

    Tewel, Y.; Gal, R.; Chechik, G.; and Atzmon, Y. 2023. Key- Locked Rank One Editing for Text -to- Image Personalization . ACM SIGGRAPH 2023 Conference Proceedings

  43. [43]

    Voynov, A.; Chu, Q.; Cohen-Or, D.; and Aberman, K. 2023. P+: Extended Textual Conditioning in Text -to- Image Generation . ArXiv:2303.09522 [cs]

  44. [44]

    Wang, Z.; Wei, W.; Zhao, Y.; Xiao, Z.; Hasegawa-Johnson, M.; Shi, H.; and Hou, T. 2023. HiFi Tuner : High - Fidelity Subject - Driven Fine - Tuning for Diffusion Models . ArXiv:2312.00079 [cs]

  45. [45]

    Wei, Y.; Zhang, Y.; Ji, Z.; Bai, J.; Zhang, L.; and Zuo, W. 2023. ELITE : Encoding Visual Concepts into Textual Embeddings for Customized Text -to- Image Generation . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

  46. [46]

    T.; Durand, F.; and Han, S

    Xiao, G.; Yin, T.; Freeman, W. T.; Durand, F.; and Han, S. 2023. FastComposer : Tuning - Free Multi - Subject Image Generation with Localized Attention . ArXiv:2305.10431 [cs]

  47. [47]

    Zhang, X.; Wei, X.-Y.; Wu, J.; Zhang, T.; Zhang, Z.; Lei, Z.; and Li, Q. 2024 a . Compositional Inversion for Stable Diffusion Models . In Proceedings of the AAAI Conference on Artificial Intelligence

  48. [48]

    Zhang, X.; Wei, X.-Y.; Zhang, W.; Wu, J.; Zhang, Z.; Lei, Z.; and Li, Q. 2024 b . A Survey on Personalized Content Synthesis with Diffusion Models . ArXiv:2405.05538 [cs]

  49. [49]

    Zhang, Y.; Yang, M.; Zhou, Q.; and Wang, Z. 2024 c . Attention Calibration for Disentangled Text -to- Image Personalization . In IEEE Conference on Computer Vision and Pattern Recognition