Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

Georgios Tsoumplekas; Stella Bounareli; Vasileios Argyriou

arxiv: 2606.03792 · v1 · pith:73BEWWBEnew · submitted 2026-06-02 · 💻 cs.CV · cs.LG

Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

Georgios Tsoumplekas , Stella Bounareli , Vasileios Argyriou This is my paper

Pith reviewed 2026-06-28 10:45 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords LoRA compositionmulti-concept generationtext-to-imageprompt-aware weightingdiffusion modelstraining-freepersonalizationconcept interference

0 comments

The pith

Prompt token importance enables training-free weighting of multiple LoRAs to reduce concept interference in image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address interference that arises when naively combining outputs from several LoRA modules adapted to different visual concepts. It does so by weighting each LoRA according to the semantic influence of its trigger words within the target prompt. Two concrete implementations, W-Switch and W-Composite, carry out this prompt-aware weighting without any additional training. The resulting composites show higher visual quality, better identity preservation, and stronger compositionality on the ComposLoRA testbed. The authors further support these gains with an image-based similarity metric that compares automatically segmented regions of generated outputs against real reference images.

Core claim

The central claim is that weighting the outputs of multiple LoRA modules by the semantic influence of their trigger words in the target prompt produces effective multi-concept combinations. This prompt-aware importance strategy, realized in the W-Switch and W-Composite methods, yields generated images with less concept interference, higher fidelity to individual references, and improved overall compositionality compared with unweighted or prior baselines.

What carries the argument

Prompt-aware importance weighting, which assigns each LoRA a weight derived from the semantic influence of its trigger words in the generation prompt.

If this is right

Generated images preserve individual concept identities more accurately than naive LoRA addition.
Quantitative gains appear consistently across the ComposLoRA benchmark and align with LLM-based and human evaluations.
The new image-based similarity framework enables direct fidelity checks between generated regions and real reference photos.
No retraining is required to compose already-adapted LoRA modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-influence logic might extend to weighting other modular adapters such as ControlNets or textual inversions.
Prompts containing overlapping or ambiguous trigger words could expose cases where token counts alone under-estimate concept priority.
Applying the weighting at different diffusion timesteps rather than uniformly might further reduce early-stage interference.

Load-bearing premise

Prompt tokens alone can reliably indicate the relative semantic importance of each concept for weighting LoRA outputs.

What would settle it

A controlled test set of multi-concept prompts where the prompt-weighted LoRA outputs exhibit measurably higher concept mixing or lower reference similarity than equal-weight combinations.

Figures

Figures reproduced from arXiv: 2606.03792 by Georgios Tsoumplekas, Stella Bounareli, Vasileios Argyriou.

**Figure 1.** Figure 1: Prompt-based relative importance estimation via textencoder similarity. Given a target prompt p and per-LoRA triggerword sets Ki , we compute a relative importance score for each LoRA. (a) PAW removes LoRA-i trigger words Ki from the prompt to form p−i and scores importance by semantic change m PAW i = 1−cos(c,c−i). (b) PTW encodes Ki and scores alignment with the prompt m PTW i = cos(c,cki ). p−i = p \ … view at source ↗

**Figure 2.** Figure 2: Overview of the two prompt-aware multi-LoRA composition methods. W-Composite aggregates the LoRA-augmented noise predictions at every timestep using fixed, prompt-derived weights wi . W-Switch activates exactly one LoRA per timestep, following a cyclic schedule with within-cycle block lengths proportional to wi . mechanisms [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: IArcFace vs. the number of composed LoRAs N. The dashed line denotes the N = 1 upper bound. Identity alignment degrades only slightly as more LoRAs are composed. 4.3. Quantitative Results We first report quantitative results on the ComposLoRA testbed, following the evaluation protocol of [69] [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of multi-LoRA composition on the ComposLoRA testbed. Columns show Switch [68], Composite [68], CMLoRA [69] and the proposed W-Composite and W-Switch methods. Each row adds one additional LoRA. nal concepts as evidenced by its accurate preservation of finegrained clothing attributes such as the blue skirt and red tie (first and second rows). Finally, for N = 5, only Switch and W-Swi… view at source ↗

**Figure 5.** Figure 5: CLIP embedding space visualization for cropped em [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Example of concept-specific cropped regions extracted [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of the proposed evaluation pipeline for [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Generated images for the three character LoRAs using [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used for the MiniCPM-based image quality evaluation. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Instructions shown to the user study participants outlin [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Sample evaluation question used in the human prefer [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Examples of failure cases for the two proposed methods. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Examples of generated images with N = 2 LoRA candidates across our proposed methods and baseline models in the ComposLoRA testbed [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Examples of generated images with N = 3 LoRA candidates across our proposed methods and baseline models in the ComposLoRA testbed [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Examples of generated images with N = 4 LoRA candidates across our proposed methods and baseline models in the ComposLoRA testbed [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Examples of generated images with N = 5 LoRA candidates across our proposed methods and baseline models in the ComposLoRA testbed [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

read the original abstract

Low-Rank Adaptation (LoRA) successfully enables personalization in text-to-image generation by adapting pre-trained diffusion models to specific visual concepts and styles. However, extending such models to multi-concept customization remains challenging. Naively combining multiple LoRA weights or their outputs often leads to interference among concepts, resulting in degraded visual quality and reduced fidelity to the reference images of individual concepts. This paper proposes a simple yet effective approach for multi-concept customization by optimally combining the outputs of multiple LoRA modules. We leverage the relative importance of each concept during generation, as inferred from its corresponding prompt tokens and introduce two methods, W-Switch and W-Composite, that employ a prompt-aware importance weighting strategy in which each LoRA is weighted according to the semantic influence of its trigger words in the target prompt. In addition, we extend existing quantitative evaluation metrics by proposing a new image-based similarity evaluation framework that assesses image fidelity and identity preservation through comparisons between real-world reference images and automatically segmented concept regions from generated images. We evaluate our approach on the ComposLoRA testbed and demonstrate consistent improvements over existing state-of-the-art methods in terms of visual quality, identity preservation and compositionality. Qualitative evaluations, including a Large Language Model (LLM) based assessment and a user study, further validate the effectiveness of the proposed methods and align with the newly introduced quantitative image-based metrics. Our code is available at https://github.com/GeorgeTsoumplekas/Prompt-Aware-Multi-LoRA-Composition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two prompt-token weighting schemes for combining LoRAs plus a segmented-image fidelity metric; gains shown on ComposLoRA but the mechanism for handling internal interference stays empirical.

read the letter

The main things here are W-Switch and W-Composite, which weight each LoRA's output by how strongly its trigger words appear in the target prompt, plus a new evaluation that segments generated images and compares concept regions to real references. They also run the usual qualitative checks with an LLM judge and a user study, and they ship the code.

What works is the practical framing. Multi-concept LoRA mixing is a real pain point, and doing it without extra training or fine-tuning is useful. The new metric tries to move past CLIP-style scores by looking at actual pixel regions, which is a reasonable direction even if segmentation quality will vary. The reported improvements over prior methods on the ComposLoRA set are consistent enough to notice.

The soft spot is the core assumption. The approach treats prompt-token importance as a reliable proxy for how much each concept should influence the final image. In diffusion, though, concept mixing happens across attention layers and residual paths during the denoising steps, not just at the output. The paper does not show equations or ablations that demonstrate why scalar output weighting is enough to untangle that internal interference; the gains are presented as empirical fact. The new metric also depends on automatic segmentation, which could introduce its own errors when concepts overlap.

This is for people building or tuning text-to-image pipelines who need quick multi-concept results. A serious referee should see it because the methods are simple to implement, the code is public, and the evaluation includes human data, but the theoretical grounding is light and the claims rest on the testbed results.

Referee Report

2 major / 2 minor

Summary. The paper claims that naively combining multiple LoRA modules for multi-concept text-to-image customization leads to interference and degraded fidelity; it introduces two training-free methods (W-Switch and W-Composite) that weight each LoRA output by the semantic importance of its trigger tokens in the target prompt, plus a new image-based similarity metric that compares segmented concept regions in generated images against real reference images. Experiments on the ComposLoRA testbed report consistent gains over prior methods in visual quality, identity preservation, and compositionality, corroborated by LLM-based assessment and a user study; code is released.

Significance. If the central claim holds, the work supplies a lightweight, training-free recipe for multi-concept personalization that could see immediate use in existing diffusion pipelines. The prompt-aware weighting idea and the proposed image-based fidelity metric are potentially reusable beyond this setting. The public code release is a clear strength that supports reproducibility.

major comments (2)

[Abstract / Method description of W-Switch and W-Composite] The core assumption that scalar weighting of LoRA outputs (via token-derived importance) suffices to resolve concept interference is load-bearing for both W-Switch and W-Composite. The manuscript states that interference arises when “combining the outputs of multiple LoRA modules,” yet diffusion-model interference is known to occur inside the U-Net via shared attention and residual connections across denoising steps; output-level weighting alone does not modulate those internal paths. No derivation or ablation is supplied showing that prompt-token importance remains a faithful proxy once latents are mixed.
[Evaluation section (new metric)] The new image-based similarity framework is presented as an extension of existing metrics, but the manuscript supplies no equations, segmentation procedure, exclusion rules, or quantitative comparison tables that would allow verification that the reported gains are not artifacts of the segmentation step or reference-image selection.

minor comments (2)

[Abstract] The abstract asserts “consistent improvements” and alignment with LLM/user-study results but contains no numerical values, dataset sizes, or statistical tests; moving these numbers into the abstract would improve readability.
[Method] Notation for the importance weights (how token influence is computed and normalized) is introduced without an explicit equation or pseudocode block, making the difference between W-Switch and W-Composite hard to reconstruct from text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance, reproducibility via code release, and potential reusability of the prompt-aware weighting and metric ideas. We address each major comment below with concrete plans for revision.

read point-by-point responses

Referee: [Abstract / Method description of W-Switch and W-Composite] The core assumption that scalar weighting of LoRA outputs (via token-derived importance) suffices to resolve concept interference is load-bearing for both W-Switch and W-Composite. The manuscript states that interference arises when “combining the outputs of multiple LoRA modules,” yet diffusion-model interference is known to occur inside the U-Net via shared attention and residual connections across denoising steps; output-level weighting alone does not modulate those internal paths. No derivation or ablation is supplied showing that prompt-token importance remains a faithful proxy once latents are mixed.

Authors: We agree that interference in diffusion models occurs through internal U-Net mechanisms. Our output-level weighting is intended as a practical, training-free approximation that modulates each LoRA's contribution according to prompt-token semantics, which our experiments show reduces observable interference in the final images. While no closed-form derivation of the proxy is provided, the consistent gains on ComposLoRA across visual quality, identity, and compositionality metrics (including LLM and user study) support its utility. In revision we will add a dedicated discussion paragraph on the internal-vs-output distinction and an ablation varying the weighting granularity to better characterize the proxy's faithfulness. revision: partial
Referee: [Evaluation section (new metric)] The new image-based similarity framework is presented as an extension of existing metrics, but the manuscript supplies no equations, segmentation procedure, exclusion rules, or quantitative comparison tables that would allow verification that the reported gains are not artifacts of the segmentation step or reference-image selection.

Authors: We accept this criticism; the current presentation of the image-based similarity framework is insufficiently detailed. The revised manuscript will include the complete equations, a precise description of the segmentation procedure and exclusion rules, and additional quantitative tables comparing the metric against baselines and alternative segmenters to confirm that reported improvements are not segmentation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a heuristic weighting strategy without load-bearing derivations or self-referential reductions

full rationale

The abstract and description present W-Switch and W-Composite as prompt-aware weighting heuristics that assign scalar weights to LoRA outputs based on token-derived importance. No equations, parameter fits, uniqueness theorems, or self-citations are invoked as load-bearing steps that would reduce the claimed improvements to the inputs by construction. The new image-based similarity metrics are extensions for evaluation rather than derivations of the core method. The approach is self-contained as an empirical combination strategy without any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract mentions no free parameters, axioms, or invented entities; all fields left empty due to insufficient detail.

pith-pipeline@v0.9.1-grok · 5811 in / 1006 out tokens · 39293 ms · 2026-06-28T10:45:07.984133+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 18 canonical work pages · 3 internal anchors

[1]

In- trinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. In- trinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 7319–7328, 2021. 3

2021
[2]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brich- tova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024. 1

work page arXiv 2024
[3]

Jedi: The force of jensen-shannon divergence in disentangling dif- fusion models.arXiv preprint arXiv:2505.19166, 2025

Eric Tillmann Bill, Enis Simsar, and Thomas Hofmann. Jedi: The force of jensen-shannon divergence in disentangling dif- fusion models.arXiv preprint arXiv:2505.19166, 2025. 2

work page arXiv 2025
[4]

Finding directions in gan’s latent space for neural face reenactment

Stella Bounareli, Vasileios Argyriou, and Georgios Tz- imiropoulos. Finding directions in gan’s latent space for neural face reenactment. InBritish Machine Vision Confer- ence, 2022. 14

2022
[5]

How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)

Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). InProceedings of the IEEE international conference on computer vision, pages 1021–1030, 2017. 6, 14

2017
[6]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 6, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Consislora: Enhancing content and style consistency for lora-based style transfer.arXiv preprint arXiv:2503.10614, 2025

Bolin Chen, Baoquan Zhao, Haoran Xie, Yi Cai, Qing Li, and Xudong Mao. Consislora: Enhancing content and style consistency for lora-based style transfer.arXiv preprint arXiv:2503.10614, 2025. 2

work page arXiv 2025
[8]

Iteris: Iterative inference-solving alignment for lora merging

Hongxu Chen, Zhen Wang, Runshi Li, Bowei Zhu, and Long Chen. Iteris: Iterative inference-solving alignment for lora merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4829–4838, 2025. 1, 2

2025
[9]

Perception prior- itized training of diffusion models

Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prior- itized training of diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 11472–11481, 2022. 5, 9

2022
[10]

Lo- rashop: Training-free multi-concept image generation and editing with rectified flow transformers.arXiv preprint arXiv:2505.23758, 2025

Yusuf Dalva, Hidir Yesiltepe, and Pinar Yanardag. Lo- rashop: Training-free multi-concept image generation and editing with rectified flow transformers.arXiv preprint arXiv:2505.23758, 2025. 3

work page arXiv 2025
[11]

Arcface: Additive angular margin loss for deep face recog- nition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recog- nition. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 4690–4699, 2019. 2, 6, 17

2019
[12]

Compositional visual generation with energy based models.Advances in Neural Information Processing Systems, 33:6637–6647, 2020

Yilun Du, Shuang Li, and Igor Mordatch. Compositional visual generation with energy based models.Advances in Neural Information Processing Systems, 33:6637–6647, 2020. 2

2020
[13]

Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024. 4

2024
[14]

Loratorio: An intrinsic approach to lora skill composition

Niki Foteinopoulou, Ignas Budvytis, and Stephan Liwicki. Loratorio: An intrinsic approach to lora skill composition. arXiv preprint arXiv:2508.11624, 2025. 1, 2, 3, 6, 7, 17, 20

work page arXiv 2025
[15]

Implicit style-content separation using b-lora

Yarden Frenkel, Yael Vinker, Ariel Shamir, and Daniel Cohen- Or. Implicit style-content separation using b-lora. InEu- ropean Conference on Computer Vision, pages 181–198. Springer, 2024. 2

2024
[16]

An image is worth one word: Personalizing text-to-image generation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InThe Eleventh Interna- tional Conference on Learning Representations, 2023. 17

2023
[17]

Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36:15890–15902, 2023

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun- peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36:15890–15902, 2023. 1, 2, 3, 6, 13

2023
[18]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021. 6, 17

2021
[19]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 3

2021
[20]

Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

2020
[21]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 1

2022
[22]

Interactdiffusion: Interaction con- trol in text-to-image diffusion models

Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap-Peng Tan, and Weipeng Hu. Interactdiffusion: Interaction con- trol in text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6180–6189, 2024. 1

2024
[23]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,
[24]

Lorahub: Efficient cross-task gener- alization via dynamic lora composition

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient cross-task gener- alization via dynamic lora composition. InFirst Conference on Language Modeling, 2024. 2

2024
[25]

Eigenlorax: Recycling adapters to find principal sub- spaces for resource-efficient adaptation and inference

Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, and Alan Yuille. Eigenlorax: Recycling adapters to find principal sub- spaces for resource-efficient adaptation and inference. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 649–659, 2025. 2

2025
[26]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Multi-concept customization of text- to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shecht- man, and Jun-Yan Zhu. Multi-concept customization of text- to-image diffusion. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 1931–1941, 2023. 2

1931
[28]

Concept weaver: Enabling multi-concept fusion in text-to-image models

Gihyun Kwon, Simon Jenni, Dingzeyu Li, Joon-Young Lee, Jong Chul Ye, and Fabian Caba Heilbron. Concept weaver: Enabling multi-concept fusion in text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8880–8889, 2024. 2

2024
[29]

The double-ellipsoid geometry of CLIP

Meir Yossef Levi and Guy Gilboa. The double-ellipsoid geometry of CLIP. InForty-second International Conference on Machine Learning, 2025. 6

2025
[30]

Autolora: Automatic lora retrieval and fine-grained gated fusion for text-to-image generation.arXiv preprint arXiv:2508.02107, 2025

Zhiwen Li, Zhongjie Duan, Die Chen, Cen Chen, Daoyuan Chen, Yaliang Li, and Yingda Chen. Autolora: Automatic lora retrieval and fine-grained gated fusion for text-to-image generation.arXiv preprint arXiv:2508.02107, 2025. 2

work page arXiv 2025
[31]

Compositional visual generation with composable diffusion models

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean conference on computer vision, pages 423–439. Springer, 2022. 1

2022
[32]

Customizable image synthesis with multiple subjects

Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Customizable image synthesis with multiple subjects. Advances in neural information processing systems, 36:57500– 57519, 2023. 2, 3

2023
[33]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022. 6

2022
[34]

Contrastive test-time composition of multiple lora models for image generation

Tuna Han Salih Meral, Enis Simsar, Federico Tombari, and Pinar Yanardag. Contrastive test-time composition of multiple lora models for image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18090–18100, 2025. 3

2025
[35]

Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on

Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Mar- cella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023. 1

2023
[36]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patrick...
[37]

K-lora: Unlock- ing training-free fusion of any subject and style loras

Ziheng Ouyang, Zhen Li, and Qibin Hou. K-lora: Unlock- ing training-free fusion of any subject and style loras. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13041–13050, 2025. 2

2025
[38]

Understanding the latent space of diffu- sion models through the lens of riemannian geometry.Ad- vances in Neural Information Processing Systems, 36:24129– 24142, 2023

Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffu- sion models through the lens of riemannian geometry.Ad- vances in Neural Information Processing Systems, 36:24129– 24142, 2023. 5, 9

2023
[39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
[40]

Tara: Token-aware lora for composable personalization in diffusion models

Yuqi Peng, Lingtao Zheng, Yufeng Yang, Yi Huang, Mingfu Yan, Jianzhuang Liu, and Shifeng Chen. Tara: Token-aware lora for composable personalization in diffusion models. arXiv preprint arXiv:2508.08812, 2025. 2

work page arXiv 2025
[41]

Orthogonal adaptation for modular customization of diffusion models

Ryan Po, Guandao Yang, Kfir Aberman, and Gordon Wet- zstein. Orthogonal adaptation for modular customization of diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7964–7973, 2024. 1, 2, 6, 13

2024
[42]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, 2024. 1

2024
[43]

Lora soups: Merg- ing loras for practical skill composition tasks

Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, and Samy Jelassi. Lora soups: Merg- ing loras for practical skill composition tasks. InProceedings of the 31st International Conference on Computational Lin- guistics: Industry Track, pages 644–655, 2025. 2

2025
[44]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 6

2021
[45]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 1

2021
[46]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3, 6

2022
[47]

Multlfg: Training-free multi-lora composi- tion using frequency-domain guidance.arXiv preprint arXiv:2505.20525, 2025

Aniket Roy, Maitreya Suin, Ketul Shah, and Rama Chellappa. Multlfg: Training-free multi-lora composi- tion using frequency-domain guidance.arXiv preprint arXiv:2505.20525, 2025. 1, 2, 3

work page arXiv 2025
[48]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 22500–22510,
[49]

Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1

2022
[50]

Ziplora: Any subject in any style by effectively merging loras

Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, and Varun Jampani. Ziplora: Any subject in any style by effectively merging loras. InEuropean Conference on Computer Vision, pages 422–438. Springer,
[51]

Donald Shenaj, Ondrej Bohdal, Mete Ozay, Pietro Zanuttigh, and Umberto Michieli. Lora. rar: Learning to merge loras via hypernetworks for subject-style conditioned image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16132–16142, 2025. 2

2025
[52]

Loraclr: Contrastive adaptation for customization of diffusion models

Enis Simsar, Thomas Hofmann, Federico Tombari, and Pinar Yanardag. Loraclr: Contrastive adaptation for customization of diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13189–13198,
[53]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. InThe Eleventh International Conference on Learning Representations, 2023. 1

2023
[54]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. pmlr, 2015. 3

2015
[55]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. 3

2021
[56]

arXiv preprint arXiv:2303.09522 , year=

Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. p+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023. 2

work page arXiv 2023
[57]

Characonsist: Fine- grained consistent character generation

Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, and Yunchao Wei. Characonsist: Fine- grained consistent character generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16058–16067, 2025. 1

2025
[58]

Fastcomposer: Tuning-free multi- subject image generation with localized attention.Interna- tional Journal of Computer Vision, 133(3):1175–1194, 2025

Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. Fastcomposer: Tuning-free multi- subject image generation with localized attention.Interna- tional Journal of Computer Vision, 133(3):1175–1194, 2025. 2, 3

2025
[59]

Qr-lora: Efficient and disentangled fine-tuning via qr decomposition for customized generation

Jiahui Yang, Yongjia Ma, Donglin Di, Jianxun Cui, Hao Li, Wei Chen, Yan Xie, Xun Yang, and Wangmeng Zuo. Qr-lora: Efficient and disentangled fine-tuning via qr decomposition for customized generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17587– 17597, 2025. 2

2025
[60]

Lora-composer: Leveraging low-rank adap- tation for multi-concept customization in training-free diffu- sion models.IEEE Transactions on Image Processing, 34: 8145–8158, 2025

Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Xiaofei He, et al. Lora-composer: Leveraging low-rank adap- tation for multi-concept customization in training-free diffu- sion models.IEEE Transactions on Image Processing, 34: 8145–8158, 2025. 3

2025
[61]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 6, 8, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout.arXiv preprint arXiv:2510.03262, 2025

Andi Zhang, Xuan Ding, Haofan Wang, Steven McDon- agh, and Samuel Kaski. Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout.arXiv preprint arXiv:2510.03262, 2025. 2

work page arXiv 2025
[63]

Subject or style: Adap- tive and training-free mixture of loras.arXiv preprint arXiv:2508.02165, 2025

Jia-Chen Zhang and Yu-Jie Xiong. Subject or style: Adap- tive and training-free mixture of loras.arXiv preprint arXiv:2508.02165, 2025. 2

work page arXiv 2025
[64]

S3fd: Single shot scale-invariant face detector

Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. S3fd: Single shot scale-invariant face detector. InProceedings of the IEEE international conference on computer vision, pages 192–201, 2017. 14

2017
[65]

Gpt-4v (ision) as a generalist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023

Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, and Linda Ruth Petzold. Gpt-4v (ision) as a general- ist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023. 17

work page arXiv 2023
[66]

Freelora: Enabling training-free lora fusion for autoregressive multi- subject personalization.arXiv preprint arXiv:2507.01792,

Peng Zheng, Ye Wang, Rui Ma, and Zuxuan Wu. Freelora: Enabling training-free lora fusion for autoregressive multi- subject personalization.arXiv preprint arXiv:2507.01792,

work page arXiv
[67]

Decouple and orthogonalize: A data-free framework for lora merging.arXiv preprint arXiv:2505.15875, 2025

Shenghe Zheng, Hongzhi Wang, Chenyu Huang, Xiaohui Wang, Tao Chen, Jiayuan Fan, Shuyue Hu, and Peng Ye. Decouple and orthogonalize: A data-free framework for lora merging.arXiv preprint arXiv:2505.15875, 2025. 2

work page arXiv 2025
[68]

Multi- lora composition for image generation.Transactions on Ma- chine Learning Research, 2024

Ming Zhong, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, Weizhu Chen, et al. Multi- lora composition for image generation.Transactions on Ma- chine Learning Research, 2024. 1, 3, 4, 6, 7, 8, 15, 17

2024
[69]

a man” or “a woman

Xiandong Zou, Mingzhu Shen, Christos-Savvas Bouganis, and Yiren Zhao. Cached multi-lora composition for multi- concept image generation. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2, 3, 6, 7, 8, 15, 17 A. Analysis of Evaluation Metrics A.1. Limitations of Existing Metrics We identify two fundamental limitations in exis...

work page arXiv 2025
[70]

Criteria: - Visual Cohesion: Evaluate whether the elements appear as part of a unified scene, rather than as disjointed parts

Element Integration: How seamlessly different elements are combined within the image. Criteria: - Visual Cohesion: Evaluate whether the elements appear as part of a unified scene, rather than as disjointed parts. - Object Overlap and Interaction: Check for natural overlaps and interactions between objects, ensuring no awkward placements or intersections
[71]

Criteria: - Stylistic Uniformity: Ensure that all elements share a consistent artistic style (e.g., realism, cartoonish)

Spatial Consistency: Uniformity in style, lighting, and perspective across all elements. Criteria: - Stylistic Uniformity: Ensure that all elements share a consistent artistic style (e.g., realism, cartoonish). - Lighting and Shadows: Verify that light sources and shadow directions are consistent, contributing to a realistic portrayal. - Perspective Align...
[72]

Criteria: - Object Accuracy: Objects should align with their descriptions in terms of type, attributes, and context

Semantic Accuracy: Correct interpretation and representation of each element as described in the prompt. Criteria: - Object Accuracy: Objects should align with their descriptions in terms of type, attributes, and context. - Action and Interaction: Actions or interactions between objects should be depicted accurately and appropriately
[73]

method",

Aesthetic Quality: Overall visual appeal and artistic quality of the generated image. Criteria: - Color Harmony: The use of color palettes should be visually pleasing and fitting for the scene. - Composition Balance: Elements should be arranged in a balanced way to create an engaging and harmonious composition. - Clarity and Sharpness: The image should be...

[1] [1]

In- trinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. In- trinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 7319–7328, 2021. 3

2021

[2] [2]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brich- tova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024. 1

work page arXiv 2024

[3] [3]

Jedi: The force of jensen-shannon divergence in disentangling dif- fusion models.arXiv preprint arXiv:2505.19166, 2025

Eric Tillmann Bill, Enis Simsar, and Thomas Hofmann. Jedi: The force of jensen-shannon divergence in disentangling dif- fusion models.arXiv preprint arXiv:2505.19166, 2025. 2

work page arXiv 2025

[4] [4]

Finding directions in gan’s latent space for neural face reenactment

Stella Bounareli, Vasileios Argyriou, and Georgios Tz- imiropoulos. Finding directions in gan’s latent space for neural face reenactment. InBritish Machine Vision Confer- ence, 2022. 14

2022

[5] [5]

How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)

Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). InProceedings of the IEEE international conference on computer vision, pages 1021–1030, 2017. 6, 14

2017

[6] [6]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 6, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Consislora: Enhancing content and style consistency for lora-based style transfer.arXiv preprint arXiv:2503.10614, 2025

Bolin Chen, Baoquan Zhao, Haoran Xie, Yi Cai, Qing Li, and Xudong Mao. Consislora: Enhancing content and style consistency for lora-based style transfer.arXiv preprint arXiv:2503.10614, 2025. 2

work page arXiv 2025

[8] [8]

Iteris: Iterative inference-solving alignment for lora merging

Hongxu Chen, Zhen Wang, Runshi Li, Bowei Zhu, and Long Chen. Iteris: Iterative inference-solving alignment for lora merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4829–4838, 2025. 1, 2

2025

[9] [9]

Perception prior- itized training of diffusion models

Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prior- itized training of diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 11472–11481, 2022. 5, 9

2022

[10] [10]

Lo- rashop: Training-free multi-concept image generation and editing with rectified flow transformers.arXiv preprint arXiv:2505.23758, 2025

Yusuf Dalva, Hidir Yesiltepe, and Pinar Yanardag. Lo- rashop: Training-free multi-concept image generation and editing with rectified flow transformers.arXiv preprint arXiv:2505.23758, 2025. 3

work page arXiv 2025

[11] [11]

Arcface: Additive angular margin loss for deep face recog- nition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recog- nition. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 4690–4699, 2019. 2, 6, 17

2019

[12] [12]

Compositional visual generation with energy based models.Advances in Neural Information Processing Systems, 33:6637–6647, 2020

Yilun Du, Shuang Li, and Igor Mordatch. Compositional visual generation with energy based models.Advances in Neural Information Processing Systems, 33:6637–6647, 2020. 2

2020

[13] [13]

Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensi- tivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018, 2024. 4

2024

[14] [14]

Loratorio: An intrinsic approach to lora skill composition

Niki Foteinopoulou, Ignas Budvytis, and Stephan Liwicki. Loratorio: An intrinsic approach to lora skill composition. arXiv preprint arXiv:2508.11624, 2025. 1, 2, 3, 6, 7, 17, 20

work page arXiv 2025

[15] [15]

Implicit style-content separation using b-lora

Yarden Frenkel, Yael Vinker, Ariel Shamir, and Daniel Cohen- Or. Implicit style-content separation using b-lora. InEu- ropean Conference on Computer Vision, pages 181–198. Springer, 2024. 2

2024

[16] [16]

An image is worth one word: Personalizing text-to-image generation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InThe Eleventh Interna- tional Conference on Learning Representations, 2023. 17

2023

[17] [17]

Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36:15890–15902, 2023

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun- peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low- rank adaptation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36:15890–15902, 2023. 1, 2, 3, 6, 13

2023

[18] [18]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021. 6, 17

2021

[19] [19]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 3

2021

[20] [20]

Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

2020

[21] [21]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 1

2022

[22] [22]

Interactdiffusion: Interaction con- trol in text-to-image diffusion models

Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap-Peng Tan, and Weipeng Hu. Interactdiffusion: Interaction con- trol in text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6180–6189, 2024. 1

2024

[23] [23]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

[24] [24]

Lorahub: Efficient cross-task gener- alization via dynamic lora composition

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient cross-task gener- alization via dynamic lora composition. InFirst Conference on Language Modeling, 2024. 2

2024

[25] [25]

Eigenlorax: Recycling adapters to find principal sub- spaces for resource-efficient adaptation and inference

Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, and Alan Yuille. Eigenlorax: Recycling adapters to find principal sub- spaces for resource-efficient adaptation and inference. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 649–659, 2025. 2

2025

[26] [26]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Multi-concept customization of text- to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shecht- man, and Jun-Yan Zhu. Multi-concept customization of text- to-image diffusion. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 1931–1941, 2023. 2

1931

[28] [28]

Concept weaver: Enabling multi-concept fusion in text-to-image models

Gihyun Kwon, Simon Jenni, Dingzeyu Li, Joon-Young Lee, Jong Chul Ye, and Fabian Caba Heilbron. Concept weaver: Enabling multi-concept fusion in text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8880–8889, 2024. 2

2024

[29] [29]

The double-ellipsoid geometry of CLIP

Meir Yossef Levi and Guy Gilboa. The double-ellipsoid geometry of CLIP. InForty-second International Conference on Machine Learning, 2025. 6

2025

[30] [30]

Autolora: Automatic lora retrieval and fine-grained gated fusion for text-to-image generation.arXiv preprint arXiv:2508.02107, 2025

Zhiwen Li, Zhongjie Duan, Die Chen, Cen Chen, Daoyuan Chen, Yaliang Li, and Yingda Chen. Autolora: Automatic lora retrieval and fine-grained gated fusion for text-to-image generation.arXiv preprint arXiv:2508.02107, 2025. 2

work page arXiv 2025

[31] [31]

Compositional visual generation with composable diffusion models

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean conference on computer vision, pages 423–439. Springer, 2022. 1

2022

[32] [32]

Customizable image synthesis with multiple subjects

Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Customizable image synthesis with multiple subjects. Advances in neural information processing systems, 36:57500– 57519, 2023. 2, 3

2023

[33] [33]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022. 6

2022

[34] [34]

Contrastive test-time composition of multiple lora models for image generation

Tuna Han Salih Meral, Enis Simsar, Federico Tombari, and Pinar Yanardag. Contrastive test-time composition of multiple lora models for image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18090–18100, 2025. 3

2025

[35] [35]

Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on

Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Mar- cella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023. 1

2023

[36] [36]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patrick...

[37] [37]

K-lora: Unlock- ing training-free fusion of any subject and style loras

Ziheng Ouyang, Zhen Li, and Qibin Hou. K-lora: Unlock- ing training-free fusion of any subject and style loras. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13041–13050, 2025. 2

2025

[38] [38]

Understanding the latent space of diffu- sion models through the lens of riemannian geometry.Ad- vances in Neural Information Processing Systems, 36:24129– 24142, 2023

Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffu- sion models through the lens of riemannian geometry.Ad- vances in Neural Information Processing Systems, 36:24129– 24142, 2023. 5, 9

2023

[39] [39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

[40] [40]

Tara: Token-aware lora for composable personalization in diffusion models

Yuqi Peng, Lingtao Zheng, Yufeng Yang, Yi Huang, Mingfu Yan, Jianzhuang Liu, and Shifeng Chen. Tara: Token-aware lora for composable personalization in diffusion models. arXiv preprint arXiv:2508.08812, 2025. 2

work page arXiv 2025

[41] [41]

Orthogonal adaptation for modular customization of diffusion models

Ryan Po, Guandao Yang, Kfir Aberman, and Gordon Wet- zstein. Orthogonal adaptation for modular customization of diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7964–7973, 2024. 1, 2, 6, 13

2024

[42] [42]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, 2024. 1

2024

[43] [43]

Lora soups: Merg- ing loras for practical skill composition tasks

Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, and Samy Jelassi. Lora soups: Merg- ing loras for practical skill composition tasks. InProceedings of the 31st International Conference on Computational Lin- guistics: Industry Track, pages 644–655, 2025. 2

2025

[44] [44]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 6

2021

[45] [45]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 1

2021

[46] [46]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3, 6

2022

[47] [47]

Multlfg: Training-free multi-lora composi- tion using frequency-domain guidance.arXiv preprint arXiv:2505.20525, 2025

Aniket Roy, Maitreya Suin, Ketul Shah, and Rama Chellappa. Multlfg: Training-free multi-lora composi- tion using frequency-domain guidance.arXiv preprint arXiv:2505.20525, 2025. 1, 2, 3

work page arXiv 2025

[48] [48]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 22500–22510,

[49] [49]

Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1

2022

[50] [50]

Ziplora: Any subject in any style by effectively merging loras

Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, and Varun Jampani. Ziplora: Any subject in any style by effectively merging loras. InEuropean Conference on Computer Vision, pages 422–438. Springer,

[51] [51]

Donald Shenaj, Ondrej Bohdal, Mete Ozay, Pietro Zanuttigh, and Umberto Michieli. Lora. rar: Learning to merge loras via hypernetworks for subject-style conditioned image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16132–16142, 2025. 2

2025

[52] [52]

Loraclr: Contrastive adaptation for customization of diffusion models

Enis Simsar, Thomas Hofmann, Federico Tombari, and Pinar Yanardag. Loraclr: Contrastive adaptation for customization of diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13189–13198,

[53] [53]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. InThe Eleventh International Conference on Learning Representations, 2023. 1

2023

[54] [54]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. pmlr, 2015. 3

2015

[55] [55]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. 3

2021

[56] [56]

arXiv preprint arXiv:2303.09522 , year=

Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. p+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023. 2

work page arXiv 2023

[57] [57]

Characonsist: Fine- grained consistent character generation

Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, and Yunchao Wei. Characonsist: Fine- grained consistent character generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16058–16067, 2025. 1

2025

[58] [58]

Fastcomposer: Tuning-free multi- subject image generation with localized attention.Interna- tional Journal of Computer Vision, 133(3):1175–1194, 2025

Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. Fastcomposer: Tuning-free multi- subject image generation with localized attention.Interna- tional Journal of Computer Vision, 133(3):1175–1194, 2025. 2, 3

2025

[59] [59]

Qr-lora: Efficient and disentangled fine-tuning via qr decomposition for customized generation

Jiahui Yang, Yongjia Ma, Donglin Di, Jianxun Cui, Hao Li, Wei Chen, Yan Xie, Xun Yang, and Wangmeng Zuo. Qr-lora: Efficient and disentangled fine-tuning via qr decomposition for customized generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17587– 17597, 2025. 2

2025

[60] [60]

Lora-composer: Leveraging low-rank adap- tation for multi-concept customization in training-free diffu- sion models.IEEE Transactions on Image Processing, 34: 8145–8158, 2025

Yang Yang, Wen Wang, Liang Peng, Chaotian Song, Yao Chen, Hengjia Li, Xiaolong Yang, Qinglin Lu, Deng Cai, Xiaofei He, et al. Lora-composer: Leveraging low-rank adap- tation for multi-concept customization in training-free diffu- sion models.IEEE Transactions on Image Processing, 34: 8145–8158, 2025. 3

2025

[61] [61]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 6, 8, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout.arXiv preprint arXiv:2510.03262, 2025

Andi Zhang, Xuan Ding, Haofan Wang, Steven McDon- agh, and Samuel Kaski. Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout.arXiv preprint arXiv:2510.03262, 2025. 2

work page arXiv 2025

[63] [63]

Subject or style: Adap- tive and training-free mixture of loras.arXiv preprint arXiv:2508.02165, 2025

Jia-Chen Zhang and Yu-Jie Xiong. Subject or style: Adap- tive and training-free mixture of loras.arXiv preprint arXiv:2508.02165, 2025. 2

work page arXiv 2025

[64] [64]

S3fd: Single shot scale-invariant face detector

Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. S3fd: Single shot scale-invariant face detector. InProceedings of the IEEE international conference on computer vision, pages 192–201, 2017. 14

2017

[65] [65]

Gpt-4v (ision) as a generalist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023

Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, and Linda Ruth Petzold. Gpt-4v (ision) as a general- ist evaluator for vision-language tasks.arXiv preprint arXiv:2311.01361, 2023. 17

work page arXiv 2023

[66] [66]

Freelora: Enabling training-free lora fusion for autoregressive multi- subject personalization.arXiv preprint arXiv:2507.01792,

Peng Zheng, Ye Wang, Rui Ma, and Zuxuan Wu. Freelora: Enabling training-free lora fusion for autoregressive multi- subject personalization.arXiv preprint arXiv:2507.01792,

work page arXiv

[67] [67]

Decouple and orthogonalize: A data-free framework for lora merging.arXiv preprint arXiv:2505.15875, 2025

Shenghe Zheng, Hongzhi Wang, Chenyu Huang, Xiaohui Wang, Tao Chen, Jiayuan Fan, Shuyue Hu, and Peng Ye. Decouple and orthogonalize: A data-free framework for lora merging.arXiv preprint arXiv:2505.15875, 2025. 2

work page arXiv 2025

[68] [68]

Multi- lora composition for image generation.Transactions on Ma- chine Learning Research, 2024

Ming Zhong, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, Weizhu Chen, et al. Multi- lora composition for image generation.Transactions on Ma- chine Learning Research, 2024. 1, 3, 4, 6, 7, 8, 15, 17

2024

[69] [69]

a man” or “a woman

Xiandong Zou, Mingzhu Shen, Christos-Savvas Bouganis, and Yiren Zhao. Cached multi-lora composition for multi- concept image generation. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2, 3, 6, 7, 8, 15, 17 A. Analysis of Evaluation Metrics A.1. Limitations of Existing Metrics We identify two fundamental limitations in exis...

work page arXiv 2025

[70] [70]

Criteria: - Visual Cohesion: Evaluate whether the elements appear as part of a unified scene, rather than as disjointed parts

Element Integration: How seamlessly different elements are combined within the image. Criteria: - Visual Cohesion: Evaluate whether the elements appear as part of a unified scene, rather than as disjointed parts. - Object Overlap and Interaction: Check for natural overlaps and interactions between objects, ensuring no awkward placements or intersections

[71] [71]

Criteria: - Stylistic Uniformity: Ensure that all elements share a consistent artistic style (e.g., realism, cartoonish)

Spatial Consistency: Uniformity in style, lighting, and perspective across all elements. Criteria: - Stylistic Uniformity: Ensure that all elements share a consistent artistic style (e.g., realism, cartoonish). - Lighting and Shadows: Verify that light sources and shadow directions are consistent, contributing to a realistic portrayal. - Perspective Align...

[72] [72]

Criteria: - Object Accuracy: Objects should align with their descriptions in terms of type, attributes, and context

Semantic Accuracy: Correct interpretation and representation of each element as described in the prompt. Criteria: - Object Accuracy: Objects should align with their descriptions in terms of type, attributes, and context. - Action and Interaction: Actions or interactions between objects should be depicted accurately and appropriately

[73] [73]

method",

Aesthetic Quality: Overall visual appeal and artistic quality of the generated image. Criteria: - Color Harmony: The use of color palettes should be visually pleasing and fitting for the scene. - Composition Balance: Elements should be arranged in a balanced way to create an engaging and harmonious composition. - Clarity and Sharpness: The image should be...