pith. machine review for the scientific record. sign in

arxiv: 2605.07257 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

Adaptive Subspace Projection for Generative Personalization

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic collapsing problemgenerative personalizationsubspace projectiontest-time embedding adjustmenttext-to-image generationprompt fidelitypersonalized concepts
0
0 comments X

The pith

Semantic drift in generative personalization concentrates in a low-dimensional subspace, enabling a training-free projection method to restore prompt fidelity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the semantic collapsing problem occurs because personalization induces a semantic drift confined to a particular low-dimensional subspace rather than spreading randomly, while also making the original base embedding unstable. It introduces AdaptSP, a training-free adjustment that anchors to the stable pre-trained embedding, isolates the drift, and projects it onto the identified subspace. This produces a targeted correction that reduces the personalized concept's dominance over prompt context. A sympathetic reader would care because the approach improves how well generated images follow detailed instructions while keeping the learned subject intact, all without any additional training steps. Experiments demonstrate gains in prompt fidelity and contextual alignment.

Core claim

Analysis of the personalization process reveals that the semantic drift causing SCP is concentrated within a specific low-dimensional subspace and that the embedding of the original base concept becomes perturbed and unstable as a reference. AdaptSP addresses this by using the pre-trained embedding as a stable anchor, isolating the drift component, and projecting it onto the identified subspace to perform a precise adjustment that mitigates semantic collapsing while preserving subject identity.

What carries the argument

Adaptive Subspace Projection (AdaptSP): the test-time mechanism that identifies the low-dimensional subspace containing semantic drift and projects the perturbation vector onto it using the pre-trained embedding as anchor.

If this is right

  • Personalized models achieve higher adherence to full text prompts without retraining.
  • Contextual details in prompts are respected while the learned subject remains recognizable.
  • The adjustment operates at test time on any already-personalized embedding.
  • Prompt fidelity improves across varied text instructions that combine the subject with other elements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If subspace identification proves consistent across different personalization techniques, the method could serve as a standard post-processing step for many embedding-based generators.
  • The same anchoring-plus-projection logic might apply to drift issues in other modalities such as video or audio generation.
  • Further tests with prompts that vary in complexity could clarify the dimensional limits of the drift subspace.

Load-bearing premise

The semantic drift is concentrated within a specific identifiable low-dimensional subspace that can be isolated and projected without losing subject identity or introducing new artifacts.

What would settle it

Applying the subspace projection to personalized embeddings and then generating images from complex contextual prompts yields no measurable gain in how accurately the outputs reflect all elements of the prompt compared to the unadjusted embeddings.

Figures

Figures reproduced from arXiv: 2605.07257 by Amardeep Kaur, Anh Tuan Bui, Dinh Phung, Junae Kim, Rollin Omari, Tamas Abraham, Thuy-Trang Vu, Van-Anh Nguyen.

Figure 1
Figure 1. Figure 1: Example of SCP happening in either fine-tuning or without fine-tuning the embedding of [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparing output of DreamBooth with and without AdaptSP ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Analysis on the CelebA dataset (a) Cumulative explained variance by the first 10 principal [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Analysis in the CC101 dataset (a) Cumulative explained variance by the first 10 principal [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of the number of principal com￾ponents (PCs) on scores [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Problem of Subject Fidelity Metrics that can assign artificially high subject-fidelity scores [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: DreamBooth on CelebA (concept 342): Qualitative comparison of DreamBooth and AdaptSP variants. The first row shows the reference images. The remaining rows show generations from different prompts and random seeds for each method. A.4 Visualization of generated images In this section, we visualize and compare images generated by different methods using the same random seed with pre-trained Stable Diffusion … view at source ↗
Figure 8
Figure 8. Figure 8: DreamBooth on CelebA (concept 908): Qualitative comparison of DreamBooth and AdaptSP variants. The first row shows the reference images. The remaining rows show generations from different prompts and random seeds for each method [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: DreamBooth on CelebA (concept 181): Qualitative comparison of DreamBooth and AdaptSP variants. The first row shows the reference images. The remaining rows show generations from different prompts and random seeds. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: DreamBooth on CC101 (concept cat): Qualitative comparison of DreamBooth and AdaptSP variants. The first row shows the reference images. The remaining rows show generations from different prompts and random seeds [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: DreamBooth on CC101 (concept teddy bear): Qualitative comparison of DreamBooth and AdaptSP variants. The first row shows the reference images. The remaining rows show genera￾tions from different prompts and random seeds. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: DreamBooth on CC101 (concept table): Qualitative comparison of DreamBooth and AdaptSP variants. The first row shows the reference images. The remaining rows show generations from different prompts and random seeds [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Custom Diffusion on CelebA (concept 342): Qualitative comparison of Custom Dif￾fusion and AdaptSP variants. The first row shows the reference images. The remaining rows show generations from different prompts and random seeds. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Custom Diffusion on CelebA (concept 908): Qualitative comparison of Custom Dif￾fusion and AdaptSP variants. The first row shows the reference images. The remaining rows show generations from different prompts and random seeds [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Custom Diffusion on CelebA (concept 181): Qualitative comparison of Custom Dif￾fusion and AdaptSP variants. The first row shows the reference images. The remaining rows show generations from different prompts and random seeds. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Custom Diffusion on CC101 (concept cat): Qualitative comparison of Custom Diffusion and AdaptSP variants. The first row shows the reference images. The remaining rows show genera￾tions from different prompts and random seeds [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Custom Diffusion on CC101 (concept teddy bear): Qualitative comparison of Custom Diffusion and AdaptSP variants. The first row shows the training images. The remaining rows show generations from different prompts and random seeds. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Custom Diffusion on CC101 (concept table): Qualitative comparison of Custom Diffusion and AdaptSP variants. The first row shows the reference images. The remaining rows show generations from different prompts and random seeds. here is that the visual concept of interest acts as the central semantic thread common to all prompts, while the contexts are as diverse and varied as possible. This diversity is cr… view at source ↗
read the original abstract

Generative personalization often suffers from the semantic collapsing problem (SCP), where a learned personalized concept overpowers the rest of the text prompt, causing the model to ignore important contextual details. To address this, we first analyze the underlying cause, revealing that the semantic drift responsible for SCP is not random but is concentrated within a specific low-dimensional subspace. We also discover that the personalization process perturbs the embedding of the original base concept, making it an unstable reference point. Based on these insights, we introduce Test-time Embedding Adjustment with Adaptive Subspace Projection (AdaptSP), a training-free method that uses the stable, pre-trained embedding as an anchor. AdaptSP isolates the semantic drift and projects it onto the identified subspace, performing a precise adjustment that mitigates SCP while maintaining the subject identity. Our experiments show that this targeted approach significantly improves prompt fidelity and contextual alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper claims that the semantic collapsing problem (SCP) in generative personalization arises because semantic drift is not random but concentrated in a specific low-dimensional subspace, and because personalization perturbs the base concept embedding into an unstable reference. It introduces AdaptSP, a training-free test-time method that anchors to the stable pre-trained embedding, adaptively identifies the drift subspace, and performs a projection adjustment to mitigate SCP while preserving subject identity, with claimed experimental gains in prompt fidelity and contextual alignment.

Significance. If the core empirical pattern and projection mechanism hold under rigorous validation, the result would be significant as a lightweight, training-free intervention that directly targets a common failure mode in personalized text-to-image models. The structured-subspace insight, if mathematically characterized, could inform embedding-space analysis more broadly in diffusion models.

major comments (3)
  1. [Abstract] Abstract and analysis section: the claim that semantic drift is 'concentrated within a specific low-dimensional subspace' is presented as the result of analysis, yet no equation, covariance construction, basis-vector definition, or invariance argument is supplied to show why the subspace is low-dimensional, consistent across subjects, or separable from identity features.
  2. [Method] Method description (AdaptSP): the projection step is described only at the level of 'isolates the semantic drift and projects it onto the identified subspace,' without a mathematical definition of the drift vector (e.g., difference between personalized and pre-trained embeddings), the adaptive subspace basis, or the orthogonal projection operator, leaving open whether identity information is inadvertently removed.
  3. [Experiments] Experiments: the abstract asserts that AdaptSP 'significantly improves prompt fidelity and contextual alignment,' but no quantitative metrics, baselines (e.g., DreamBooth or Textual Inversion), ablation studies, or error analysis are referenced, rendering the performance claims unverifiable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and the recommendation for major revision. The comments highlight areas where mathematical formalization and experimental referencing can be strengthened, and we will incorporate these improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and analysis section: the claim that semantic drift is 'concentrated within a specific low-dimensional subspace' is presented as the result of analysis, yet no equation, covariance construction, basis-vector definition, or invariance argument is supplied to show why the subspace is low-dimensional, consistent across subjects, or separable from identity features.

    Authors: We agree that the supporting mathematical details were not made explicit. In the revision we will expand the analysis section with the definition of the drift vector as the difference between personalized and pre-trained embeddings, the covariance matrix constructed from drift vectors across multiple subjects, the basis vectors obtained as the leading eigenvectors of that matrix, and an invariance argument based on the stability of the dominant eigenspace under changes in personalization strength. This will also demonstrate separability by showing that identity-related directions remain orthogonal to the identified drift subspace. revision: yes

  2. Referee: [Method] Method description (AdaptSP): the projection step is described only at the level of 'isolates the semantic drift and projects it onto the identified subspace,' without a mathematical definition of the drift vector (e.g., difference between personalized and pre-trained embeddings), the adaptive subspace basis, or the orthogonal projection operator, leaving open whether identity information is inadvertently removed.

    Authors: We acknowledge the need for a precise formulation. The revised method section will explicitly define the drift vector, describe the adaptive construction of the subspace basis (via principal components of observed drift vectors with variance-based rank selection), and state the orthogonal projection operator applied to adjust the embedding. We will add a short argument that identity is preserved because the adjustment operates only within the drift subspace while leaving the orthogonal complement unchanged, supported by similarity measurements before and after projection. revision: yes

  3. Referee: [Experiments] Experiments: the abstract asserts that AdaptSP 'significantly improves prompt fidelity and contextual alignment,' but no quantitative metrics, baselines (e.g., DreamBooth or Textual Inversion), ablation studies, or error analysis are referenced, rendering the performance claims unverifiable.

    Authors: We agree that the abstract should reference the supporting evidence. Our experiments section reports quantitative results using CLIP-based prompt fidelity and contextual alignment scores, direct comparisons against DreamBooth and Textual Inversion, ablation studies on subspace dimension and projection strength, and error analysis on failure cases. In the revision we will update the abstract to cite these metrics, baselines, and studies so that the performance claims are directly traceable to the reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent empirical analysis of drift subspace

full rationale

The paper first performs an analysis to reveal that semantic drift is concentrated in a low-dimensional subspace and that personalization perturbs the base embedding. It then introduces AdaptSP as a training-free test-time projection method that uses the pre-trained embedding as an external anchor and projects the identified drift. No equations, parameter fits, or self-citations are shown that would make the claimed mitigation equivalent to the input observations by construction. The subspace identification is presented as a data-driven discovery rather than a definitional or fitted tautology, and the adjustment step operates on quantities treated as independently observable. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that drift is subspace-concentrated and that pre-trained embeddings are stable anchors. No explicit free parameters, axioms, or invented entities are stated in the abstract; the subspace itself is discovered rather than postulated a priori.

pith-pipeline@v0.9.0 · 5469 in / 1129 out tokens · 21156 ms · 2026-05-11T01:25:38.501332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 4 internal anchors

  1. [1]

    Palp: Prompt aligned personalization of text-to-image models

    Moab Arar, Andrey V oynov, Amir Hertz, Omri Avrahami, Shlomi Fruchter, Yael Pritch, Daniel Cohen-Or, and Ariel Shamir. Palp: Prompt aligned personalization of text-to-image models. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  2. [2]

    Break-a- scene: Extracting multiple concepts from a single image

    Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. Break-a- scene: Extracting multiple concepts from a single image. InSIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023

  3. [3]

    Interpreting CLIP with sparse linear concept embeddings (spliCE)

    Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio Calmon, and Himabindu Lakkaraju. Interpreting CLIP with sparse linear concept embeddings (spliCE). InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  4. [4]

    Mitigating semantic collapse in generative personalization with a surprisingly simple test-time embedding adjustment.arXiv e-prints, pages arXiv–2506, 2025

    Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, and Dinh Phung. Mitigating semantic collapse in generative personalization with a surprisingly simple test-time embedding adjustment.arXiv e-prints, pages arXiv–2506, 2025

  5. [5]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  6. [6]

    Artadapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation

    Dar-Yen Chen, Hamish Tennent, and Ching-Wen Hsu. Artadapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8619–8628, 2024

  7. [7]

    Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation.arXiv preprint arXiv:2305.03374, 2023

    Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation.arXiv preprint arXiv:2305.03374, 2023

  8. [8]

    PhotoVerse: Tuning-free image customization with text-to-image diffusion models.arXiv preprint arXiv:2309.05793, 2023b

    Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, Jing Liu, Kang Du, et al. Photoverse: Tuning-free image customization with text-to-image diffusion models.arXiv preprint arXiv:2309.05793, 2023

  9. [9]

    Dreamidentity: enhanced editability for efficient face-identity preserved image generation

    Zhuowei Chen, Shancheng Fang, Wei Liu, Qian He, Mengqi Huang, and Zhendong Mao. Dreamidentity: enhanced editability for efficient face-identity preserved image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1281–1289, 2024

  10. [10]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

  11. [11]

    Svdiff: Compact parameter space for diffusion fine-tuning

    Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, and Feng Yang. Svdiff: Compact parameter space for diffusion fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7323–7334, 2023

  12. [12]

    CLIPScore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, D...

  13. [13]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  14. [14]

    Classdiffusion: More aligned personalization tuning with explicit class guidance

    Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Humphrey Shi, and Yunchao Wei. Classdiffusion: More aligned personalization tuning with explicit class guidance. InThe Thirteenth International Conference on Learning Representations, 2025

  15. [15]

    Reversion: Diffusion- based relation inversion from images

    Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin CK Chan, and Ziwei Liu. Reversion: Diffusion- based relation inversion from images. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 11

  16. [16]

    Scedit: Efficient and controllable image diffusion generation via skip connection editing

    Zeyinzi Jiang, Chaojie Mao, Yulin Pan, Zhen Han, and Jingfeng Zhang. Scedit: Efficient and controllable image diffusion generation via skip connection editing. InProceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pages 8995–9004, 2024

  17. [17]

    An image is worth multiple words: Discovering object level concepts using multi-concept prompt learning

    Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, and Philip Alexander Teare. An image is worth multiple words: Discovering object level concepts using multi-concept prompt learning. InForty-first International Conference on Machine Learning, 2024

  18. [18]

    Omg: Occlusion-friendly personalized multi-concept generation in diffusion models

    Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang, Bizhu Wu, Guanying Chen, Wei Liu, and Wenhan Luo. Omg: Occlusion-friendly personalized multi-concept generation in diffusion models. InEuropean Conference on Computer Vision, pages 253–270. Springer, 2024

  19. [19]

    Multi- concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023

  20. [20]

    Generate anything anywhere in any scene.arXiv preprint arXiv:2306.17154, 2023

    Yuheng Li, Haotian Liu, Yangming Wen, and Yong Jae Lee. Generate anything anywhere in any scene.arXiv preprint arXiv:2306.17154, 2023

  21. [21]

    PhotoMaker: Customizing realistic human photos via stacked ID embedding.arXiv preprint arXiv:2312.04461, 2023c

    Zeke Li, Yue Bai, Yi Zhou, Youtao Li, Haoran Zhou, Yanhong Zhang, Lun Qi, Hongfang He, and Liang Zhao. Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint arXiv:2312.04461, 2023

  22. [22]

    Style- crafter: Enhancing stylized text-to-video generation with style adapter.arXiv preprint arXiv:2312.00330, 2023

    Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Yibo Wang, Xintao Wang, Yujiu Yang, and Ying Shan. Stylecrafter: Enhancing stylized text-to-video generation with style adapter.arXiv preprint arXiv:2312.00330, 2023

  23. [23]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of International Conference on Computer Vision (ICCV), December 2015

  24. [24]

    Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models

    Saman Motamed, Danda Pani Paudel, and Luc Van Gool. Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models. In European Conference on Computer Vision, pages 116–133. Springer, 2024

  25. [25]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 4296–4304, 2024

  26. [26]

    Chatgpt, 2026

    OpenAI. Chatgpt, 2026

  27. [27]

    Controlling text-to-image diffusion by orthogonal finetuning

    Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems, 36:79320–79362, 2023

  28. [28]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  29. [29]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  30. [30]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023

  31. [31]

    Clic: Concept learning in context

    Mehdi Safaee, Aryan Mikaeili, Or Patashnik, Daniel Cohen-Or, and Ali Mahdavi-Amiri. Clic: Concept learning in context. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6924–6933, 2024. 12

  32. [32]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, et al. Imagen: Photorealistic text-to-image diffusion models with deep language understanding.arXiv preprint arXiv:2205.11487, 2022

  33. [33]

    Styledrop: Text-to-image generation in any style.arXiv preprint arXiv:2306.00983,

    Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, et al. Styledrop: Text-to-image generation in any style.arXiv preprint arXiv:2306.00983, 2023

  34. [34]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  35. [35]

    Ominicontrol: Minimal and universal control for diffusion transformer

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. 2025

  36. [36]

    Ominicontrol2: Efficient conditioning for diffusion transformers.arXiv preprint arXiv:2503.08280, 2025

    Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, and Xinchao Wang. Ominicontrol2: Efficient conditioning for diffusion transformers.arXiv preprint arXiv:2503.08280, 2025

  37. [37]

    Key-locked rank one editing for text-to-image personalization

    Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personalization. InACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023

  38. [38]

    Face0: Instantaneously conditioning a text-to-image model on a face

    Dani Valevski, Danny Lumen, Yossi Matias, and Yaniv Leviathan. Face0: Instantaneously conditioning a text-to-image model on a face. InSIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023

  39. [39]

    Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

  40. [40]

    Fastcom- poser: Tuning-free multi-subject image generation with localized attention.International Journal of Computer Vision, pages 1–20, 2024

    Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. Fastcom- poser: Tuning-free multi-subject image generation with localized attention.International Journal of Computer Vision, pages 1–20, 2024

  41. [41]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  42. [42]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  43. [43]

    Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

    Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

  44. [44]

    Quantifying structure in CLIP embeddings: A statistical framework for concept interpretation.Transactions on Machine Learning Research, 2026

    Jitian Zhao, Chenghui Li, Frederic Sala, and Karl Rohe. Quantifying structure in CLIP embeddings: A statistical framework for concept interpretation.Transactions on Machine Learning Research, 2026

  45. [45]

    Limitations

    Chenyang Zhu, Kai Li, Yue Ma, Chunming He, and Xiu Li. Multibooth: Towards generating all your concepts in an image from text. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10923–10931, 2025. 13 Figure 6: Problem of Subject Fidelity Metrics that can assign artificially high subject-fidelity scores to overfitting generat...

  46. [46]

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...