PureCC: Pure Learning for Text-to-Image Concept Customization
Pith reviewed 2026-05-21 11:04 UTC · model grok-4.3
The pith
PureCC decouples concept guidance from original predictions to keep the base text-to-image model intact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale λ* to dynamically adjust the guidance
What carries the argument
decoupled learning objective that combines implicit target-concept guidance with the original conditional prediction
If this is right
- The base model retains its original behavior and capabilities after the customization process.
- High-fidelity images of the new concept can be produced while performance on unrelated tasks stays close to the unmodified model.
- The adaptive guidance scale provides explicit control over the trade-off between concept strength and model preservation.
- The dual-branch design jointly optimizes for both customization fidelity and original-model retention.
Where Pith is reading between the lines
- Similar decoupling of guidance and base prediction could be tested in other generative tasks such as text-to-video or audio synthesis.
- The approach may reduce unintended side effects when multiple users apply successive customizations to the same base model.
- The adaptive scale could serve as a general mechanism for balancing adaptation and stability in other fine-tuning settings.
Load-bearing premise
The decoupled learning objective that pairs implicit guidance for the target concept with the original conditional prediction lets the model focus on the base model during training without losing customization quality.
What would settle it
Evaluating the customized model on a benchmark of standard prompts unrelated to the new concept and finding that its outputs degrade in quality or fidelity compared with the unmodified base model or with other customization methods.
Figures
read the original abstract
Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization. However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts. To address this issue, we propose PureCC. PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale $\lambda^\star$ to dynamically adjust the guidance strength of the target concept, balancing customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization. The code is available at https://github.com/lzc-sg/PureCC.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PureCC for text-to-image concept customization. It introduces a decoupled learning objective that combines implicit guidance of the target concept with the original conditional prediction, implemented via a dual-branch pipeline (frozen extractor for purified concept representations as implicit guidance; trainable flow model for original conditional prediction). An adaptive guidance scale λ* is added to balance fidelity and preservation. The central claim is that this yields state-of-the-art performance in preserving the original model's behavior and capabilities alongside high-fidelity customization, backed by extensive experiments and publicly released code.
Significance. If the preservation results hold under broad evaluation, the work is significant for addressing the common side-effect of concept customization degrading base-model capabilities, a practical barrier to deployment. The decoupled objective and dual-branch design with frozen extractor constitute a distinct technical contribution over prior personalization methods. Public code release supports reproducibility.
major comments (3)
- [Abstract and §3] Abstract and §3 (method): the central preservation claim rests on the assertion that the decoupled objective plus frozen extractor isolates original conditional prediction without hidden leakage of concept features into the trainable flow model. No formal argument, leakage analysis, or ablation isolating this separation is supplied; the adaptive λ* may compensate for rather than eliminate entanglement, directly undermining the 'pure learning' guarantee.
- [§4] §4 (experiments): the SOTA preservation claim is asserted without reported quantitative metrics, baselines, or test-prompt coverage in the abstract and without evidence that preservation holds beyond limited prompts; this is load-bearing because the skeptic concern is precisely whether reported preservation reflects broad capability retention or narrow evaluation.
- [§3.3] §3.3 (adaptive guidance): λ* is described as dynamically adjusting guidance strength, yet the text indicates post-hoc tuning without robustness checks across concepts or prompt distributions; this weakens the claim that the method achieves stable preservation without manual intervention.
minor comments (2)
- [§3.2] Clarify whether the flow model remains fully frozen during inference or only during the original-prediction branch of training.
- [§4] Add explicit comparison table against recent preservation-aware baselines (e.g., those using regularization or orthogonal losses) rather than only fidelity-focused methods.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below, with clear indications of planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): the central preservation claim rests on the assertion that the decoupled objective plus frozen extractor isolates original conditional prediction without hidden leakage of concept features into the trainable flow model. No formal argument, leakage analysis, or ablation isolating this separation is supplied; the adaptive λ* may compensate for rather than eliminate entanglement, directly undermining the 'pure learning' guarantee.
Authors: We acknowledge that the current manuscript lacks a formal mathematical argument or explicit leakage analysis for the isolation property. The dual-branch design with a frozen extractor is intended to enforce separation by preventing gradient flow of concept-specific signals into the trainable flow model parameters. To directly address this, we will add a new subsection in §3 with a gradient-flow diagram, an ablation comparing frozen versus trainable extractor variants, and quantitative leakage measurements (e.g., feature similarity before and after training). We will also clarify that λ* is not intended as a compensatory mechanism but as a fidelity-preservation balancer, with supporting analysis. revision: yes
-
Referee: [§4] §4 (experiments): the SOTA preservation claim is asserted without reported quantitative metrics, baselines, or test-prompt coverage in the abstract and without evidence that preservation holds beyond limited prompts; this is load-bearing because the skeptic concern is precisely whether reported preservation reflects broad capability retention or narrow evaluation.
Authors: We agree that the abstract should explicitly reference the quantitative preservation results and that broader test-prompt coverage should be emphasized. Section 4 already reports quantitative metrics (including CLIP-based preservation scores and capability retention measures) against several baselines on a collection of prompts spanning multiple categories. In the revision we will (i) update the abstract to include key quantitative preservation numbers, (ii) expand the description of the test-prompt set (currently >150 prompts across 8 categories), and (iii) add further baseline comparisons to strengthen the claim of broad capability retention. revision: yes
-
Referee: [§3.3] §3.3 (adaptive guidance): λ* is described as dynamically adjusting guidance strength, yet the text indicates post-hoc tuning without robustness checks across concepts or prompt distributions; this weakens the claim that the method achieves stable preservation without manual intervention.
Authors: We clarify that λ* is computed on-the-fly during training from the relative strength of the concept representation rather than via post-hoc manual selection. We nevertheless recognize the absence of systematic robustness verification. In the revised manuscript we will add a dedicated robustness study in §3.3 and §4 that evaluates λ* stability across 20+ concepts and varied prompt distributions, including sensitivity plots and failure-case analysis. revision: partial
Circularity Check
No circularity: PureCC's decoupled objective is a novel proposal evaluated empirically
full rationale
The paper introduces a new decoupled learning objective that combines implicit target concept guidance with original conditional prediction, implemented through a dual-branch pipeline featuring a frozen extractor and trainable flow model, plus an adaptive λ* scale. This is framed as an original methodological contribution rather than a re-expression or reduction of prior results. No self-citations, uniqueness theorems, or fitted parameters are shown to load-bear the central preservation claim; instead, SOTA performance is asserted via extensive experiments. The derivation chain does not reduce by construction to its inputs, and the method remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- adaptive guidance scale λ*
axioms (1)
- domain assumption The original conditional prediction can be separated from target concept guidance in the learning objective for pure learning.
Reference graph
Works this paper leans on
-
[1]
Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, and Xiangyu Yue. Ditctrl: Exploring attention control in multi-modal dif- fusion transformer for tuning-free multi-prompt longer video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7763–7772, 2025. 3
work page 2025
-
[2]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6, 12
work page 2021
-
[3]
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023. 3
work page 2023
-
[4]
Omniinsert: Mask-free video insertion of any reference via diffusion transformer models,
Jinshu Chen, Xinghui Li, Xu Bai, Tianxiang Ma, Pengze Zhang, Zhuowei Chen, Gen Li, Lijie Liu, Songtao Zhao, Bingchuan Li, et al. Omniinsert: Mask-free video inser- tion of any reference via diffusion transformer models.arXiv preprint arXiv:2509.17627, 2025. 2
-
[5]
Chen, B., Martí Monsó, D., Du, Y ., Simchowitz, M., Tedrake, R., and Sitzmann, V
Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained clas- sifier free guidance for diffusion models.arXiv preprint arXiv:2406.08070, 2024. 3
-
[6]
Diversity-rewarded cfg distillation.arXiv preprint arXiv:2410.06084, 2024
Geoffrey Cideron, Andrea Agostinelli, Johan Ferret, Sertan Girgin, Romuald Elie, Olivier Bachem, Sarah Perrin, and Alexandre Ram´e. Diversity-rewarded cfg distillation.arXiv preprint arXiv:2410.06084, 2024. 3
-
[7]
Yusuf Dalva, Hidir Yesiltepe, and Pinar Yanardag. Lo- rashop: Training-free multi-concept image generation and editing with rectified flow transformers.arXiv preprint arXiv:2505.23758, 2025. 3
-
[8]
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 3
work page 2021
-
[9]
Jiahua Dong, Wenqi Liang, Hongliu Li, Duzhen Zhang, Meng Cao, Henghui Ding, Salman H Khan, and Fahad Shah- baz Khan. How to continually adapt text-to-image diffusion models for flexible customization?Advances in Neural In- formation Processing Systems, 37:130057–130083, 2024. 3, 6, 7
work page 2024
-
[10]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 2, 3, 6, 12
work page 2024
-
[11]
Implicit style-content separation using b-lora
Yarden Frenkel, Yael Vinker, Ariel Shamir, and Daniel Cohen-Or. Implicit style-content separation using b-lora. In European Conference on Computer Vision, pages 181–198. Springer, 2024. 2, 6, 7, 12
work page 2024
-
[12]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun- peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Sys- tems, 36:15890–15902, 2023. 2, 3, 6, 7, 12
work page 2023
-
[14]
Pulid: Pure and lightning id customization via contrastive alignment
Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng Zhang, and Qian He. Pulid: Pure and lightning id customization via contrastive alignment.arXiv preprint arXiv:2404.16022, 2024. 3
-
[15]
Aid: Attention interpolation of text-to-image diffusion.arXiv preprint arXiv:2403.17924, 2024
Qiyuan He, Jinghao Wang, Ziwei Liu, and Angela Yao. Aid: Attention interpolation of text-to-image diffusion.arXiv preprint arXiv:2403.17924, 2024. 3
-
[16]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2
work page 2020
-
[18]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 1, 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Humphrey Shi, and Yunchao Wei. Classdiffu- sion: More aligned personalization tuning with explicit class guidance.arXiv preprint arXiv:2405.17532, 2024. 3
-
[20]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 6, 12
work page 2023
-
[21]
Pick-a-pic: An open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation
-
[22]
Multi-concept customization of text-to-image diffusion
Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 1, 3
work page 1931
-
[23]
Any- dressing: Customizable multi-garment virtual dressing via latent diffusion models
Xinghui Li, Qichao Sun, Pengze Zhang, Fulong Ye, Zhichao Liao, Wanquan Feng, Songtao Zhao, and Qian He. Any- dressing: Customizable multi-garment virtual dressing via latent diffusion models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23723–23733. IEEE, 2025. 2 9
work page 2025
-
[24]
Freehand sketch generation from mechanical components
Zhichao Liao, Fengyuan Piao, Di Huang, Xinghui Li, Yue Ma, Pingfa Feng, Heming Fang, and Long Zeng. Freehand sketch generation from mechanical components. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 6755–6764, 2024. 3
work page 2024
-
[25]
Zhichao Liao, Xiaokun Liu, Wenyu Qin, Qingyu Li, Qiulin Wang, Pengfei Wan, Di Zhang, Long Zeng, and Pingfa Feng. Humanaesexpert: Advancing a multi-modality foundation model for human image aesthetic assessment.arXiv preprint arXiv:2503.23907, 2025. 6, 16
-
[26]
Dreamfit: Garment-centric human generation via a lightweight anything-dressing en- coder
Ente Lin, Xujie Zhang, Fuwei Zhao, Yuxuan Luo, Xin Dong, Long Zeng, and Xiaodan Liang. Dreamfit: Garment-centric human generation via a lightweight anything-dressing en- coder. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5218–5226, 2025. 2
work page 2025
-
[27]
Yang Lin, Xinyu Ma, Xu Chu, Yujie Jin, Zhibang Yang, Yasha Wang, and Hong Mei. Lora dropout as a spar- sity regularizer for overfitting control.arXiv preprint arXiv:2404.09610, 2024. 3
-
[28]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject- diffusion: Open domain personalized text-to-image genera- tion without test-time fine-tuning. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024. 3
work page 2024
-
[31]
Dreamo: A unified framework for image customization,
Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915,
-
[32]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[33]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3, 6, 12
work page 2021
-
[34]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 3
work page 2020
-
[35]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2
work page 2022
-
[36]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 1, 2, 3, 6, 7, 12, 13
work page 2023
-
[37]
Eliminating oversaturation and artifacts of high guid- ance scales in diffusion models
Seyedmorteza Sadat, Otmar Hilliges, and Romann M We- ber. Eliminating oversaturation and artifacts of high guid- ance scales in diffusion models. InThe Thirteenth Interna- tional Conference on Learning Representations, 2024. 3
work page 2024
-
[38]
Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, and Romann M Weber. No training, no problem: Rethinking classifier-free guidance for diffusion models.arXiv preprint arXiv:2407.02687, 2024. 3
-
[39]
Overcoming catastrophic forgetting with hard attention to the task
Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. InInternational conference on machine learning, pages 4548–4557. PMLR, 2018. 6, 7
work page 2018
-
[40]
Loraclr: Contrastive adaptation for customization of diffusion models
Enis Simsar, Thomas Hofmann, Federico Tombari, and Pinar Yanardag. Loraclr: Contrastive adaptation for customization of diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13189–13198,
-
[41]
Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024
Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shra- may Palta, Micah Goldblum, Jonas Geiping, Abhinav Shri- vastava, and Tom Goldstein. Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024. 6, 12
-
[42]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[43]
Claude 3.5 Sonnet. Claude 3.5 sonnet.https://www. anthropic . com / news / claude - 3 - 5 - sonnet,
-
[44]
p+: Ex- tended textual conditioning in text-to-image generation,
Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. p+: Extended textual conditioning in text-to- image generation.arXiv preprint arXiv:2303.09522, 2023. 3
-
[45]
Tokencompose: Text-to-image diffusion with token-level supervision
Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, and Zhuowen Tu. Tokencompose: Text-to-image diffusion with token-level supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8553–8564, 2024. 3
work page 2024
-
[46]
Less-to-more generalization: Unlocking more controllability by in-context generation,
Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025. 6, 7, 12
-
[47]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Xiaole Xian, Zhichao Liao, Qingyu Li, Wenyu Qin, Pengfei Wan, Weicheng Xie, Long Zeng, Linlin Shen, and Pingfa Feng. Spf-portrait: Towards pure text-to-portrait customiza- tion with semantic pollution-free fine-tuning.arXiv preprint arXiv:2504.00396, 2025. 3
-
[49]
Guangxuan Xiao, Tianwei Yin, William T Freeman, Fr ´edo Durand, and Song Han. Fastcomposer: Tuning-free multi- subject image generation with localized attention.Interna- 10 tional Journal of Computer Vision, 133(3):1175–1194, 2025. 3
work page 2025
-
[50]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Ssr-encoder: Encoding selective subject representation for subject-driven generation
Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024. 3
work page 2024
-
[52]
Enhancing semantic fidelity in text-to-image synthesis: Attention regulation in diffusion models
Yang Zhang, Teoh Tze Tzun, Lim Wei Hern, and Kenji Kawaguchi. Enhancing semantic fidelity in text-to-image synthesis: Attention regulation in diffusion models. In European Conference on Computer Vision, pages 70–86. Springer, 2024. 3
work page 2024
-
[53]
Multi-lora composition for image generation
Ming Zhong, Yelong Shen, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, and Weizhu Chen. Multi-lora composition for image generation. arXiv preprint arXiv:2402.16843, 2024. 3, 6, 7, 12
-
[54]
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Xiangyang Zhu, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit.Advances in Neural Infor- mation Processing Systems, 37:131278–131315, 2024. 3 11 PureCC: Pure Learning for Text-to-Image Concept Customization Supplementary Material
work page 2024
-
[55]
Some samples can be seen in Fig
Dataset Details To ensure a fairQualitative Evaluationwith previous methods, we selected 14 personalized concepts from the dataset proposed by DreamBooth [36]. Some samples can be seen in Fig. 10. Furthermore, to assess the adaptabil- ity of our method across a wider range of scenarios, we additionally collected a batch of novel personalized con- cepts, w...
-
[56]
More Implementation Details We perform training on an NVIDIA A100 GPU with a batch size of 2. For each personalized concept, both the represen- tation extractorv θ1 t and the trainable modelv θ2 t are trained in 400 steps. All images are generated using the default inference setting of 28 timesteps
-
[57]
Evaluation Metrics Details Since we are working with a new task setting—Pure Con- cept Customization—we specifically applied representative metrics to suit our task setting for quantitative evaluation. Fidelity of the personalized concept.For instance-level concepts, we employCLIP-I (target)[33] andDINO[2] to evaluate the similarity between the target con...
-
[58]
Qualitative and Quantitative Evaluation Details. We performed qualitative evaluations on our personalized dataset, which has been expanded to include new instance and style concepts. This dataset comprises a total of 30 personalized concepts, as shown in Fig. 10 and Fig. 11. For a comprehensive quantitative evaluation, we utilized DreamBenchPCC, as shown ...
-
[59]
Computational Cost Compared with previous approaches, our method intro- duces an additional training stage and employs an additional model branch in the Pure Learning (Stage-2) phase, which inevitably increases training time and GPU memory usage. To clarify, as shown in Tab. 4, we emphasize that: 1) Al- though an extra training stage is required, complete...
-
[60]
As shown in the qualitative results in Fig
Analysis of HyperparameterηinL P CC Since our Pure Concept Customization lossL P CC intro- duces a weighting parameterηto modulate the pure learn- ing lossL P ureCC , we further analyze the sensitivity ofη. As shown in the qualitative results in Fig. 15 and the quan- titative comparison in Tab. 6, an excessively largeηleads to over-injection of the target...
-
[61]
Analysis of the Original Conditional Pre- dictionv original t In the main paper, we employv original t =v θ2 t (xt|ybase), treating the output of the trainable model as the original conditional prediction. A more effective and intuitive strat- egy would be to usev original t =v θ3 t (xt|ybase), i.e., obtain the original conditional prediction directly fro...
-
[62]
Ablation Study Details We provide a detailed explanation of our ablation study here. To demonstrate the effectiveness of our proposed novel learning objective, we designed an ablation experi- ment forL P ureCC . This experiment involves fine-tuning using only the traditionalL CC for concept customiza- tion and comparing it with fine-tuning using our compl...
-
[63]
Original Behavior Consistency,
User Study Besides qualitative and quantitative comparisons, to thor- oughly evaluate our method, we carried out a user study to determine whether our method is preferred by humans for pure concept customization. We engaged 42 partici- pants from diverse social backgrounds, with each test ses- sion lasting approximately 30 minutes. During the inves- tigat...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.