pith. sign in

arxiv: 2603.07561 · v2 · pith:72CVBSOMnew · submitted 2026-03-08 · 💻 cs.CV

PureCC: Pure Learning for Text-to-Image Concept Customization

Pith reviewed 2026-05-21 11:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image generationconcept customizationpersonalized synthesismodel preservationdecoupled learningdiffusion modelsadaptive guidance
0
0 comments X

The pith

PureCC decouples concept guidance from original predictions to keep the base text-to-image model intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PureCC tackles the common problem that customizing text-to-image models for new concepts often damages how the original model handles other prompts and tasks. It introduces a decoupled learning objective that supplies implicit guidance for the target concept while still requiring the model to produce its standard conditional predictions. This separation is realized in a dual-branch pipeline: one frozen branch extracts clean representations of the new concept, and a trainable branch handles the original prediction flow. An adaptive guidance scale further tunes how strongly the new concept pulls the output. The result is high-fidelity customization paired with stronger retention of the base model's prior behavior and abilities.

Core claim

PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale λ* to dynamically adjust the guidance

What carries the argument

decoupled learning objective that combines implicit target-concept guidance with the original conditional prediction

If this is right

  • The base model retains its original behavior and capabilities after the customization process.
  • High-fidelity images of the new concept can be produced while performance on unrelated tasks stays close to the unmodified model.
  • The adaptive guidance scale provides explicit control over the trade-off between concept strength and model preservation.
  • The dual-branch design jointly optimizes for both customization fidelity and original-model retention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar decoupling of guidance and base prediction could be tested in other generative tasks such as text-to-video or audio synthesis.
  • The approach may reduce unintended side effects when multiple users apply successive customizations to the same base model.
  • The adaptive scale could serve as a general mechanism for balancing adaptation and stability in other fine-tuning settings.

Load-bearing premise

The decoupled learning objective that pairs implicit guidance for the target concept with the original conditional prediction lets the model focus on the base model during training without losing customization quality.

What would settle it

Evaluating the customized model on a benchmark of standard prompts unrelated to the new concept and finding that its outputs degrade in quality or fidelity compared with the unmodified base model or with other customization methods.

Figures

Figures reproduced from arXiv: 2603.07561 by Liang Pan, Long Zeng, Meng Wang, Pingfa Feng, Qingyu Li, Siyang Song, Weicheng Xie, Wenyu Qin, Xiaole Xian, Zhichao Liao.

Figure 1
Figure 1. Figure 1: We introduce PureCC, a novel concept customization approach. (a) PureCC effectively maintains target-unrelated image elements with original model’s behavior after the personalized concept insertion. (b) Existing methods such as DreamBooth [36] and LoRA [18] fail to follow the prompt ‘placed on a bright window’ during custom generation. (c) The declined curve indicates that existing methods compromise the o… view at source ↗
Figure 2
Figure 2. Figure 2: Original Distribution Drift. Visualization and KL Di￾vergence results demonstrated that existing methods, which adjust pre-trained models to align with the target distribution for learning personalized concepts, lead to distribution drift. sonalized concept. To decouple the target concept from the custom set, we first introduce a representation extractor that uses a pre-trained flow-based model as the back… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our PureCC. (a). We first fine-tune a flow model on the custom set as representation extractor. (b). During the pure learning stage, the representation extractor remains frozen and provides the target concept representation, which is then controlled by our adaptive scale λ ⋆ to implicitly guide the trainable model. The trainable model is initialized from another pre-trained flow model and provi… view at source ↗
Figure 4
Figure 4. Figure 4: Motivation of Adaptive Scale λ ⋆ . A small λ can pre￾serve the original model’s behavior and capabilities but leads to a decrease in the fidelity of the target concept. Conversely, when λ is excessively large, the personalized concept dominates the learning objective, causing the final distribution to drift away from the original distribution. This results in a degradation of the model’s generative ability… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Comparison with SOTAs including Tuning-based methods: DreamBooth [36], DreamBooth + EWC [39], Mix-of￾Show [13], CIFC [9], and Tuning-free methods: DreamO [31] UNO [46]. 5. Experiments 5.1. Experimental Setup Dataset. To ensure a fair qualitative evaluation with pre￾vious methods, we select 14 personalized concepts from the dataset proposed by DreamBooth [36]. Furthermore, to evaluate the adapta… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison in Multi-Concept Cus￾tomization with Mix-of-show [13], LoRA-S [53]. B-LoRA CIFC DreamO PureCC (Ours) Instance Style A [V1] dog playing in a blooming garden, vibrant flowers, in [V2] style. A dog playing in a blooming garden, vibrant flowers. [V2] style A [V1] dog playing in a blooming garden, vibrant flowers, in [V3] style. [V1] dog [V3] style [V1] dog Original [PITH_FULL_IMAGE:figu… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of style–instance customiza￾tion across different methods, including CIFC [9], B-LoRA [11], DreamO [31]. B-LoRA is a tuning-based approach specifically designed for balancing style and content adaptation. Each case combines an instance concept with a specific style. customization tasks. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of Pure Learning Process. xb P ureCC 0 denotes the images obtained by integrating the velocity field {v P ureCC t } T t=1. Similarly, xb original 0 and xb complete 0 are based on {v original t } T t=1 and {v complete t } T t=1, respectively. “iter” denotes the training iteration [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of the Ablation Study [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Some samples selected from the dataset proposed by DreamBooth [36] [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Some samples additionally collected by us [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Some style samples in our DreamBenchPCC. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More qualitative evaluation results 14 [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: More qualitative evaluation results 15 [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative Analysis of the η [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Results compared to which Original Conditional Prediction is provided by the Trainable model v θ2 t (xt|ybase) or another Frozen pre-trained model v θ3 t (xt|ybase) . original model’s outputs, disregarding the insertion of per￾sonalized concepts. For “Base Text Alignment,” users were tasked with choosing which of the two images more accu￾rately align with the base text’s description. For “Aesthetic Prefer… view at source ↗
Figure 17
Figure 17. Figure 17: The investigation page in user study [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗
read the original abstract

Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization. However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts. To address this issue, we propose PureCC. PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale $\lambda^\star$ to dynamically adjust the guidance strength of the target concept, balancing customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization. The code is available at https://github.com/lzc-sg/PureCC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PureCC for text-to-image concept customization. It introduces a decoupled learning objective that combines implicit guidance of the target concept with the original conditional prediction, implemented via a dual-branch pipeline (frozen extractor for purified concept representations as implicit guidance; trainable flow model for original conditional prediction). An adaptive guidance scale λ* is added to balance fidelity and preservation. The central claim is that this yields state-of-the-art performance in preserving the original model's behavior and capabilities alongside high-fidelity customization, backed by extensive experiments and publicly released code.

Significance. If the preservation results hold under broad evaluation, the work is significant for addressing the common side-effect of concept customization degrading base-model capabilities, a practical barrier to deployment. The decoupled objective and dual-branch design with frozen extractor constitute a distinct technical contribution over prior personalization methods. Public code release supports reproducibility.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (method): the central preservation claim rests on the assertion that the decoupled objective plus frozen extractor isolates original conditional prediction without hidden leakage of concept features into the trainable flow model. No formal argument, leakage analysis, or ablation isolating this separation is supplied; the adaptive λ* may compensate for rather than eliminate entanglement, directly undermining the 'pure learning' guarantee.
  2. [§4] §4 (experiments): the SOTA preservation claim is asserted without reported quantitative metrics, baselines, or test-prompt coverage in the abstract and without evidence that preservation holds beyond limited prompts; this is load-bearing because the skeptic concern is precisely whether reported preservation reflects broad capability retention or narrow evaluation.
  3. [§3.3] §3.3 (adaptive guidance): λ* is described as dynamically adjusting guidance strength, yet the text indicates post-hoc tuning without robustness checks across concepts or prompt distributions; this weakens the claim that the method achieves stable preservation without manual intervention.
minor comments (2)
  1. [§3.2] Clarify whether the flow model remains fully frozen during inference or only during the original-prediction branch of training.
  2. [§4] Add explicit comparison table against recent preservation-aware baselines (e.g., those using regularization or orthogonal losses) rather than only fidelity-focused methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, with clear indications of planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): the central preservation claim rests on the assertion that the decoupled objective plus frozen extractor isolates original conditional prediction without hidden leakage of concept features into the trainable flow model. No formal argument, leakage analysis, or ablation isolating this separation is supplied; the adaptive λ* may compensate for rather than eliminate entanglement, directly undermining the 'pure learning' guarantee.

    Authors: We acknowledge that the current manuscript lacks a formal mathematical argument or explicit leakage analysis for the isolation property. The dual-branch design with a frozen extractor is intended to enforce separation by preventing gradient flow of concept-specific signals into the trainable flow model parameters. To directly address this, we will add a new subsection in §3 with a gradient-flow diagram, an ablation comparing frozen versus trainable extractor variants, and quantitative leakage measurements (e.g., feature similarity before and after training). We will also clarify that λ* is not intended as a compensatory mechanism but as a fidelity-preservation balancer, with supporting analysis. revision: yes

  2. Referee: [§4] §4 (experiments): the SOTA preservation claim is asserted without reported quantitative metrics, baselines, or test-prompt coverage in the abstract and without evidence that preservation holds beyond limited prompts; this is load-bearing because the skeptic concern is precisely whether reported preservation reflects broad capability retention or narrow evaluation.

    Authors: We agree that the abstract should explicitly reference the quantitative preservation results and that broader test-prompt coverage should be emphasized. Section 4 already reports quantitative metrics (including CLIP-based preservation scores and capability retention measures) against several baselines on a collection of prompts spanning multiple categories. In the revision we will (i) update the abstract to include key quantitative preservation numbers, (ii) expand the description of the test-prompt set (currently >150 prompts across 8 categories), and (iii) add further baseline comparisons to strengthen the claim of broad capability retention. revision: yes

  3. Referee: [§3.3] §3.3 (adaptive guidance): λ* is described as dynamically adjusting guidance strength, yet the text indicates post-hoc tuning without robustness checks across concepts or prompt distributions; this weakens the claim that the method achieves stable preservation without manual intervention.

    Authors: We clarify that λ* is computed on-the-fly during training from the relative strength of the concept representation rather than via post-hoc manual selection. We nevertheless recognize the absence of systematic robustness verification. In the revised manuscript we will add a dedicated robustness study in §3.3 and §4 that evaluates λ* stability across 20+ concepts and varied prompt distributions, including sensitivity plots and failure-case analysis. revision: partial

Circularity Check

0 steps flagged

No circularity: PureCC's decoupled objective is a novel proposal evaluated empirically

full rationale

The paper introduces a new decoupled learning objective that combines implicit target concept guidance with original conditional prediction, implemented through a dual-branch pipeline featuring a frozen extractor and trainable flow model, plus an adaptive λ* scale. This is framed as an original methodological contribution rather than a re-expression or reduction of prior results. No self-citations, uniqueness theorems, or fitted parameters are shown to load-bear the central preservation claim; instead, SOTA performance is asserted via extensive experiments. The derivation chain does not reduce by construction to its inputs, and the method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that target concept representations can be purified independently via a frozen extractor and that the original conditional prediction can be maintained separately during training; no free parameters beyond the adaptive scale are detailed in the abstract.

free parameters (1)
  • adaptive guidance scale λ*
    Dynamically adjusted parameter to balance customization fidelity and model preservation, introduced as novel but without specified fitting procedure.
axioms (1)
  • domain assumption The original conditional prediction can be separated from target concept guidance in the learning objective for pure learning.
    Invoked as the basis for the decoupled objective and dual-branch pipeline.

pith-pipeline@v0.9.0 · 5745 in / 1142 out tokens · 32692 ms · 2026-05-21T11:04:58.433836+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 8 internal anchors

  1. [1]

    Ditctrl: Exploring attention control in multi-modal dif- fusion transformer for tuning-free multi-prompt longer video generation

    Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, and Xiangyu Yue. Ditctrl: Exploring attention control in multi-modal dif- fusion transformer for tuning-free multi-prompt longer video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7763–7772, 2025. 3

  2. [2]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6, 12

  3. [3]

    Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023. 3

  4. [4]

    Omniinsert: Mask-free video insertion of any reference via diffusion transformer models,

    Jinshu Chen, Xinghui Li, Xu Bai, Tianxiang Ma, Pengze Zhang, Zhuowei Chen, Gen Li, Lijie Liu, Songtao Zhao, Bingchuan Li, et al. Omniinsert: Mask-free video inser- tion of any reference via diffusion transformer models.arXiv preprint arXiv:2509.17627, 2025. 2

  5. [5]

    Chen, B., Martí Monsó, D., Du, Y ., Simchowitz, M., Tedrake, R., and Sitzmann, V

    Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained clas- sifier free guidance for diffusion models.arXiv preprint arXiv:2406.08070, 2024. 3

  6. [6]

    Diversity-rewarded cfg distillation.arXiv preprint arXiv:2410.06084, 2024

    Geoffrey Cideron, Andrea Agostinelli, Johan Ferret, Sertan Girgin, Romuald Elie, Olivier Bachem, Sarah Perrin, and Alexandre Ram´e. Diversity-rewarded cfg distillation.arXiv preprint arXiv:2410.06084, 2024. 3

  7. [7]

    Lo- rashop: Training-free multi-concept image generation and editing with rectified flow transformers.arXiv preprint arXiv:2505.23758, 2025

    Yusuf Dalva, Hidir Yesiltepe, and Pinar Yanardag. Lo- rashop: Training-free multi-concept image generation and editing with rectified flow transformers.arXiv preprint arXiv:2505.23758, 2025. 3

  8. [8]

    Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 3

  9. [9]

    How to continually adapt text-to-image diffusion models for flexible customization?Advances in Neural In- formation Processing Systems, 37:130057–130083, 2024

    Jiahua Dong, Wenqi Liang, Hongliu Li, Duzhen Zhang, Meng Cao, Henghui Ding, Salman H Khan, and Fahad Shah- baz Khan. How to continually adapt text-to-image diffusion models for flexible customization?Advances in Neural In- formation Processing Systems, 37:130057–130083, 2024. 3, 6, 7

  10. [10]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 2, 3, 6, 12

  11. [11]

    Implicit style-content separation using b-lora

    Yarden Frenkel, Yael Vinker, Ariel Shamir, and Daniel Cohen-Or. Implicit style-content separation using b-lora. In European Conference on Computer Vision, pages 181–198. Springer, 2024. 2, 6, 7, 12

  12. [12]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 3

  13. [13]

    Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Sys- tems, 36:15890–15902, 2023

    Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun- peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Sys- tems, 36:15890–15902, 2023. 2, 3, 6, 7, 12

  14. [14]

    Pulid: Pure and lightning id customization via contrastive alignment

    Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng Zhang, and Qian He. Pulid: Pure and lightning id customization via contrastive alignment.arXiv preprint arXiv:2404.16022, 2024. 3

  15. [15]

    Aid: Attention interpolation of text-to-image diffusion.arXiv preprint arXiv:2403.17924, 2024

    Qiyuan He, Jinghao Wang, Ziwei Liu, and Angela Yao. Aid: Attention interpolation of text-to-image diffusion.arXiv preprint arXiv:2403.17924, 2024. 3

  16. [16]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 3

  17. [17]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  18. [18]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 1, 2, 3, 4

  19. [19]

    Classdiffu- sion: More aligned personalization tuning with explicit class guidance.arXiv preprint arXiv:2405.17532, 2024

    Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Humphrey Shi, and Yunchao Wei. Classdiffu- sion: More aligned personalization tuning with explicit class guidance.arXiv preprint arXiv:2405.17532, 2024. 3

  20. [20]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 6, 12

  21. [21]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation

  22. [22]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 1, 3

  23. [23]

    Any- dressing: Customizable multi-garment virtual dressing via latent diffusion models

    Xinghui Li, Qichao Sun, Pengze Zhang, Fulong Ye, Zhichao Liao, Wanquan Feng, Songtao Zhao, and Qian He. Any- dressing: Customizable multi-garment virtual dressing via latent diffusion models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23723–23733. IEEE, 2025. 2 9

  24. [24]

    Freehand sketch generation from mechanical components

    Zhichao Liao, Fengyuan Piao, Di Huang, Xinghui Li, Yue Ma, Pingfa Feng, Heming Fang, and Long Zeng. Freehand sketch generation from mechanical components. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 6755–6764, 2024. 3

  25. [25]

    Humanaesexpert: Advancing a multi-modality foundation model for human image aesthetic assessment.arXiv preprint arXiv:2503.23907, 2025

    Zhichao Liao, Xiaokun Liu, Wenyu Qin, Qingyu Li, Qiulin Wang, Pengfei Wan, Di Zhang, Long Zeng, and Pingfa Feng. Humanaesexpert: Advancing a multi-modality foundation model for human image aesthetic assessment.arXiv preprint arXiv:2503.23907, 2025. 6, 16

  26. [26]

    Dreamfit: Garment-centric human generation via a lightweight anything-dressing en- coder

    Ente Lin, Xujie Zhang, Fuwei Zhao, Yuxuan Luo, Xin Dong, Long Zeng, and Xiaodan Liang. Dreamfit: Garment-centric human generation via a lightweight anything-dressing en- coder. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5218–5226, 2025. 2

  27. [27]

    Lora dropout as a spar- sity regularizer for overfitting control.arXiv preprint arXiv:2404.09610, 2024

    Yang Lin, Xinyu Ma, Xu Chu, Yujie Jin, Zhibang Yang, Yasha Wang, and Hong Mei. Lora dropout as a spar- sity regularizer for overfitting control.arXiv preprint arXiv:2404.09610, 2024. 3

  28. [28]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2

  29. [29]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 3

  30. [30]

    Subject- diffusion: Open domain personalized text-to-image genera- tion without test-time fine-tuning

    Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject- diffusion: Open domain personalized text-to-image genera- tion without test-time fine-tuning. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024. 3

  31. [31]

    Dreamo: A unified framework for image customization,

    Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915,

  32. [32]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  33. [33]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3, 6, 12

  34. [34]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 3

  35. [35]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2

  36. [36]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 1, 2, 3, 6, 7, 12, 13

  37. [37]

    Eliminating oversaturation and artifacts of high guid- ance scales in diffusion models

    Seyedmorteza Sadat, Otmar Hilliges, and Romann M We- ber. Eliminating oversaturation and artifacts of high guid- ance scales in diffusion models. InThe Thirteenth Interna- tional Conference on Learning Representations, 2024. 3

  38. [38]

    No training, no problem: Rethinking classifier-free guidance for diffusion models.arXiv preprint arXiv:2407.02687, 2024

    Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, and Romann M Weber. No training, no problem: Rethinking classifier-free guidance for diffusion models.arXiv preprint arXiv:2407.02687, 2024. 3

  39. [39]

    Overcoming catastrophic forgetting with hard attention to the task

    Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. InInternational conference on machine learning, pages 4548–4557. PMLR, 2018. 6, 7

  40. [40]

    Loraclr: Contrastive adaptation for customization of diffusion models

    Enis Simsar, Thomas Hofmann, Federico Tombari, and Pinar Yanardag. Loraclr: Contrastive adaptation for customization of diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13189–13198,

  41. [41]

    Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

    Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shra- may Palta, Micah Goldblum, Jonas Geiping, Abhinav Shri- vastava, and Tom Goldstein. Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024. 6, 12

  42. [42]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2

  43. [43]

    Claude 3.5 sonnet.https://www

    Claude 3.5 Sonnet. Claude 3.5 sonnet.https://www. anthropic . com / news / claude - 3 - 5 - sonnet,

  44. [44]

    p+: Ex- tended textual conditioning in text-to-image generation,

    Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. p+: Extended textual conditioning in text-to- image generation.arXiv preprint arXiv:2303.09522, 2023. 3

  45. [45]

    Tokencompose: Text-to-image diffusion with token-level supervision

    Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, and Zhuowen Tu. Tokencompose: Text-to-image diffusion with token-level supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8553–8564, 2024. 3

  46. [46]

    Less-to-more generalization: Unlocking more controllability by in-context generation,

    Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025. 6, 7, 12

  47. [47]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  48. [48]

    Spf-portrait: Towards pure text-to-portrait customiza- tion with semantic pollution-free fine-tuning.arXiv preprint arXiv:2504.00396, 2025

    Xiaole Xian, Zhichao Liao, Qingyu Li, Wenyu Qin, Pengfei Wan, Weicheng Xie, Long Zeng, Linlin Shen, and Pingfa Feng. Spf-portrait: Towards pure text-to-portrait customiza- tion with semantic pollution-free fine-tuning.arXiv preprint arXiv:2504.00396, 2025. 3

  49. [49]

    Fastcomposer: Tuning-free multi- subject image generation with localized attention.Interna- 10 tional Journal of Computer Vision, 133(3):1175–1194, 2025

    Guangxuan Xiao, Tianwei Yin, William T Freeman, Fr ´edo Durand, and Song Han. Fastcomposer: Tuning-free multi- subject image generation with localized attention.Interna- 10 tional Journal of Computer Vision, 133(3):1175–1194, 2025. 3

  50. [50]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

  51. [51]

    Ssr-encoder: Encoding selective subject representation for subject-driven generation

    Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024. 3

  52. [52]

    Enhancing semantic fidelity in text-to-image synthesis: Attention regulation in diffusion models

    Yang Zhang, Teoh Tze Tzun, Lim Wei Hern, and Kenji Kawaguchi. Enhancing semantic fidelity in text-to-image synthesis: Attention regulation in diffusion models. In European Conference on Computer Vision, pages 70–86. Springer, 2024. 3

  53. [53]

    Multi-lora composition for image generation

    Ming Zhong, Yelong Shen, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, and Weizhu Chen. Multi-lora composition for image generation. arXiv preprint arXiv:2402.16843, 2024. 3, 6, 7, 12

  54. [54]

    Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Infor- mation Processing Systems, 37:131278–131315, 2024

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Infor- mation Processing Systems, 37:131278–131315, 2024. 3 11 PureCC: Pure Learning for Text-to-Image Concept Customization Supplementary Material

  55. [55]

    Some samples can be seen in Fig

    Dataset Details To ensure a fairQualitative Evaluationwith previous methods, we selected 14 personalized concepts from the dataset proposed by DreamBooth [36]. Some samples can be seen in Fig. 10. Furthermore, to assess the adaptabil- ity of our method across a wider range of scenarios, we additionally collected a batch of novel personalized con- cepts, w...

  56. [56]

    For each personalized concept, both the represen- tation extractorv θ1 t and the trainable modelv θ2 t are trained in 400 steps

    More Implementation Details We perform training on an NVIDIA A100 GPU with a batch size of 2. For each personalized concept, both the represen- tation extractorv θ1 t and the trainable modelv θ2 t are trained in 400 steps. All images are generated using the default inference setting of 28 timesteps

  57. [57]

    Evaluation Metrics Details Since we are working with a new task setting—Pure Con- cept Customization—we specifically applied representative metrics to suit our task setting for quantitative evaluation. Fidelity of the personalized concept.For instance-level concepts, we employCLIP-I (target)[33] andDINO[2] to evaluate the similarity between the target con...

  58. [58]

    Generate a same-style image

    Qualitative and Quantitative Evaluation Details. We performed qualitative evaluations on our personalized dataset, which has been expanded to include new instance and style concepts. This dataset comprises a total of 30 personalized concepts, as shown in Fig. 10 and Fig. 11. For a comprehensive quantitative evaluation, we utilized DreamBenchPCC, as shown ...

  59. [59]

    To clarify, as shown in Tab

    Computational Cost Compared with previous approaches, our method intro- duces an additional training stage and employs an additional model branch in the Pure Learning (Stage-2) phase, which inevitably increases training time and GPU memory usage. To clarify, as shown in Tab. 4, we emphasize that: 1) Al- though an extra training stage is required, complete...

  60. [60]

    As shown in the qualitative results in Fig

    Analysis of HyperparameterηinL P CC Since our Pure Concept Customization lossL P CC intro- duces a weighting parameterηto modulate the pure learn- ing lossL P ureCC , we further analyze the sensitivity ofη. As shown in the qualitative results in Fig. 15 and the quan- titative comparison in Tab. 6, an excessively largeηleads to over-injection of the target...

  61. [61]

    Analysis of the Original Conditional Pre- dictionv original t In the main paper, we employv original t =v θ2 t (xt|ybase), treating the output of the trainable model as the original conditional prediction. A more effective and intuitive strat- egy would be to usev original t =v θ3 t (xt|ybase), i.e., obtain the original conditional prediction directly fro...

  62. [62]

    To demonstrate the effectiveness of our proposed novel learning objective, we designed an ablation experi- ment forL P ureCC

    Ablation Study Details We provide a detailed explanation of our ablation study here. To demonstrate the effectiveness of our proposed novel learning objective, we designed an ablation experi- ment forL P ureCC . This experiment involves fine-tuning using only the traditionalL CC for concept customiza- tion and comparing it with fine-tuning using our compl...

  63. [63]

    Original Behavior Consistency,

    User Study Besides qualitative and quantitative comparisons, to thor- oughly evaluate our method, we carried out a user study to determine whether our method is preferred by humans for pure concept customization. We engaged 42 partici- pants from diverse social backgrounds, with each test ses- sion lasting approximately 30 minutes. During the inves- tigat...