Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models

Angela Yao; Guodong Ding

arxiv: 2605.08702 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI

Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models

Guodong Ding , Angela Yao This is my paper

Pith reviewed 2026-05-12 00:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords compositional personalizationvision-language modelsLoRA adapterszero-shot learningmodel merginggating mechanismpersonalization

0 comments

The pith

Gate-and-Merge achieves zero-shot compositional personalization of vision-language models by training each concept as an independent LoRA adapter then merging and gating them at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of making vision-language models recognize and describe multiple user-specified concepts together at test time, without any training examples that show those concepts co-occurring. It proposes learning each concept separately as a lightweight LoRA adapter paired with its own token, leaving the base model and the concepts disentangled. At inference the adapters are merged directly in weight space while a gating step uses textual and visual cues to activate only the relevant modules and discard inconsistent updates. Experiments demonstrate consistent improvements over baselines on both single-concept and multi-concept personalization benchmarks.

Core claim

Compositional personalization is possible in a zero-shot manner: each concept is captured by its own LoRA adapter and token during independent personalization; at inference the adapters are merged in parameter space and a gating function selects only the modules whose textual and visual activations are mutually consistent, thereby suppressing interference while preserving each concept's identity.

What carries the argument

The Gate-and-Merge operation that merges concept-specific LoRA weight deltas in parameter space and applies a cue-based gate to select and combine only the most relevant and consistent updates.

If this is right

Performance improves on both single-concept and compositional personalization tasks without any co-occurrence training data.
Concepts remain disentangled because only the base model is shared and each adapter is trained in isolation.
Only mutually consistent adapters are combined, which stabilizes the joint prediction.
The approach applies to multiple downstream personalization tasks including recognition and description.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same independent-adapter-plus-merge pattern could be tested on other parameter-efficient modules beyond LoRA.
Dynamic addition of new concepts becomes feasible because no joint retraining is required when a fresh adapter is introduced.
The gating logic might be extended to decide not only which adapters to include but also their relative strengths based on scene content.

Load-bearing premise

Independently trained LoRA adapters for separate concepts can be merged in weight space at inference time, guided by a gating mechanism, without causing significant interference or loss of each concept's distinct identity.

What would settle it

A test set in which merged outputs systematically mix attributes across concepts or fail to describe joint scenes that the individual adapters handled correctly.

Figures

Figures reproduced from arXiv: 2605.08702 by Angela Yao, Guodong Ding.

**Figure 2.** Figure 2: The proposed Gate-and-Merge framework. (a) During personalization, each concept is independently learned as a LoRA module [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of personalized captioning capabilities across different models. From top to bottom, each row shows [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

This paper tackles compositional personalization of vision-language models (VLMs). In this problem, multiple user-defined concepts must be recognized or described jointly at test time. We introduce Gate-and-Merge, a zero-shot framework that enables compositional personalization without the need for co-occurrence training. During personalization, each concept is learned independently as a lightweight LoRA adapter, paired with a concept token. The base model remains unchanged and concepts are kept disentangled. At inference, we enable composition by merging concept-specific LoRA updates directly in weight space. To suppress irrelevant activations and prevent interference, a gating mechanism is employed to estimate textual and visual cues and select only the modules that contribute to the prediction. We further stabilize composition by combining only the most meaningful and mutually consistent updates, helping preserve each concept's identity. Our quantitative and qualitative analyses show consistent gains in performance across multiple personalization tasks in both single-concept and compositional settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gate-and-Merge combines independent LoRA training with weight-space merging and cue-based gating for zero-shot VLM composition, but the no-interference claim needs stronger experimental backing than the abstract suggests.

read the letter

The paper's main contribution is a zero-shot framework for composing multiple user-defined concepts in VLMs. Each concept gets its own LoRA adapter trained independently on separate data, then at inference the adapters are merged directly in weight space while a gating step uses textual and visual cues to pick only the relevant updates and drop the rest. This avoids any need for co-occurrence examples during training and leaves the base model frozen. That setup is new relative to prior personalization work that either requires joint data or does not merge in this way. It is practically useful when collecting images showing concepts together is difficult, and the disentangled training phase keeps things modular. The authors report consistent gains on both single-concept and compositional tasks, which would be a real step forward if the numbers hold up across datasets and baselines. The soft spot is the assumption that merging independent LoRAs produces limited interference in the shared attention and feed-forward layers. The abstract mentions selecting “most meaningful and mutually consistent updates,” but without an explicit merging formula, a consistency metric, or ablations that measure per-concept accuracy drop under composition, it is unclear how much cross-concept leakage occurs through vision-language cross-attention. If the gating is imperfect, compositional performance could fall below the single-concept baseline. The full paper presumably includes quantitative results, error bars, and comparisons, but those details are what determine whether the central claim is supported. This work is aimed at researchers doing efficient adapter-based customization of VLMs. A reader already working on LoRA merging or personalization would find the framework worth examining even if they end up changing the gating or merging rule. I would send it to peer review so the experiments can be checked for proper controls on interference and stronger baselines.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Gate-and-Merge, a zero-shot framework for compositional personalization of vision-language models. Each user-defined concept is learned independently via a lightweight LoRA adapter paired with a concept token, leaving the base VLM unchanged. At inference, concept-specific LoRA updates are merged directly in weight space, with a gating mechanism that uses textual and visual cues to select only the most relevant and mutually consistent modules, thereby suppressing interference and preserving individual concept identity without requiring co-occurrence training data. The authors report consistent performance gains across multiple personalization tasks in both single-concept and compositional settings, supported by quantitative and qualitative analyses.

Significance. If the weight-space merging and gating mechanism can be shown to reliably avoid cross-concept interference while preserving identity, the approach would represent a meaningful engineering advance for modular, zero-shot personalization of VLMs. It sidesteps the data-collection burden of joint training and could enable more flexible user adaptation in vision-language tasks. The emphasis on disentangled adapters and inference-time composition is a strength, though its practical impact hinges on empirical validation of orthogonality assumptions in shared VLM layers.

major comments (3)

[§3.2] §3.2 (Method, merging procedure): the description of selecting 'most meaningful and mutually consistent updates' lacks an explicit formula, consistency metric, or interference measure (e.g., per-concept accuracy drop under joint activation). Without this, it is impossible to verify that the claimed suppression of cross-concept activations holds in practice.
[§4] §4 (Experiments): the abstract and results claim 'consistent gains' across single-concept and compositional settings, yet no quantitative numbers, baselines, error bars, dataset details, or ablation on the gating threshold are supplied. This prevents assessment of whether the central claim of zero-shot composition without interference is supported by the data.
[§3.3] §3.3 (Gating mechanism): the textual/visual cue-based gate is presented as preventing interference, but no analysis of residual coupling through vision-language cross-attention layers or failure cases under composition is provided. This is load-bearing for the orthogonality assumption underlying weight-space merging.

minor comments (2)

[§3] Notation for LoRA rank, scaling factor, and gating threshold should be defined explicitly in the method section and used consistently in equations.
[§4.3] Figure captions for qualitative examples should include the exact prompts and concept tokens used to allow reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the paper to improve clarity, add missing details, and strengthen empirical support where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (Method, merging procedure): the description of selecting 'most meaningful and mutually consistent updates' lacks an explicit formula, consistency metric, or interference measure (e.g., per-concept accuracy drop under joint activation). Without this, it is impossible to verify that the claimed suppression of cross-concept activations holds in practice.

Authors: We agree that the merging procedure requires a more explicit formulation. In the revised manuscript, we have added the precise mathematical definition of the weight-space merging operation in §3.2, including the consistency metric derived from cue alignment scores. We also report an interference measure by comparing per-concept accuracy under joint versus isolated adapter activation. These additions allow direct verification of interference suppression. revision: yes
Referee: [§4] §4 (Experiments): the abstract and results claim 'consistent gains' across single-concept and compositional settings, yet no quantitative numbers, baselines, error bars, dataset details, or ablation on the gating threshold are supplied. This prevents assessment of whether the central claim of zero-shot composition without interference is supported by the data.

Authors: We have revised §4 to include the specific quantitative metrics, baseline comparisons, error bars, full dataset specifications, and a dedicated ablation on the gating threshold. These additions provide the necessary evidence to evaluate the performance gains and the zero-shot composition claims. revision: yes
Referee: [§3.3] §3.3 (Gating mechanism): the textual/visual cue-based gate is presented as preventing interference, but no analysis of residual coupling through vision-language cross-attention layers or failure cases under composition is provided. This is load-bearing for the orthogonality assumption underlying weight-space merging.

Authors: We acknowledge the need for deeper analysis of the gating mechanism. The revised §3.3 now includes a discussion of residual coupling through cross-attention layers and presents qualitative failure cases under multi-concept composition. These additions clarify the practical limits of the orthogonality assumption while supporting the gating approach with empirical observations from our experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces Gate-and-Merge as a practical engineering framework for zero-shot compositional personalization via independent LoRA adapters and a gating mechanism. No equations, derivations, or first-principles results are presented that reduce claimed performance gains to quantities defined by the method's own fitted inputs or self-citations. The approach is described through independent training of concept-specific adapters followed by inference-time merging and gating, with gains validated via external quantitative and qualitative analyses on personalization tasks. This keeps the central claims self-contained and falsifiable against benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions from parameter-efficient fine-tuning literature and adapter merging techniques; no new entities are postulated.

free parameters (2)

LoRA rank and scaling factor
Standard hyperparameters for each concept adapter that must be chosen or tuned.
Gating threshold or selection criteria
Parameters controlling which adapters are activated based on estimated cues.

axioms (2)

domain assumption LoRA updates for different concepts remain sufficiently disentangled when merged in weight space
Invoked in the inference-time merging step to preserve concept identity.
domain assumption Textual and visual cues can reliably estimate which concept modules are relevant
Central to the gating mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5450 in / 1498 out tokens · 67114 ms · 2026-05-12T00:51:20.248227+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

Myvlm: Personalizing vlms for user-specific queries

Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aber- man, and Daniel Cohen-Or. Myvlm: Personalizing vlms for user-specific queries. InEuropean Conference on Computer Vision, pages 73–91. Springer, 2024. 1, 2, 5, 6, 7

work page 2024
[2]

Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 1, 5

work page arXiv 2024
[3]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 5

work page 2021
[4]

Rap: Retrieval-augmented personalization for multimodal large language models

Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, and Xiangyu Yue. Rap: Retrieval-augmented personalization for multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14538–14548, 2025. 1, 5, 6, 7, 8

work page 2025
[5]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

work page 2022
[6]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 2

work page 2023
[7]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2, 5

work page 2023
[8]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Yo’llava: Your personalized lan- guage and vision assistant.Advances in Neural Information Processing Systems, 37:40913–40951, 2024

Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized lan- guage and vision assistant.Advances in Neural Information Processing Systems, 37:40913–40951, 2024. 1, 2, 3, 5, 6, 7, 8

work page 2024
[10]

Orthogonal adaptation for modular customization of diffusion models

Ryan Po, Guandao Yang, Kfir Aberman, and Gordon Wet- zstein. Orthogonal adaptation for modular customization of diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7964–7973, 2024. 2

work page 2024
[11]

Lora soups: Merg- ing loras for practical skill composition tasks

Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, and Samy Jelassi. Lora soups: Merg- ing loras for practical skill composition tasks. InProceed- ings of the 31st International Conference on Computational Linguistics: Industry Track, pages 644–655, 2025. 3

work page 2025
[12]

Multlfg: Training-free multi-lora composi- tion using frequency-domain guidance.arXiv preprint arXiv:2505.20525, 2025

Aniket Roy, Maitreya Suin, Ketul Shah, and Rama Chellappa. Multlfg: Training-free multi-lora composi- tion using frequency-domain guidance.arXiv preprint arXiv:2505.20525, 2025. 3

work page arXiv 2025
[13]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2

work page 2023
[14]

Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Pro- cessing Systems, 37:9565–9584, 2024

Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Cheng- Zhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Pro- cessing Systems, 37:9565–9584, 2024. 2

work page 2024
[15]

Visionllm: Large language model is also an open- ended decoder for vision-centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513, 2023

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open- ended decoder for vision-centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513, 2023. 1, 2

work page 2023
[16]

Universality and limitations of prompt tuning.Advances in Neural Information Processing Systems, 36:75623–75643,

Yihan Wang, Jatin Chauhan, Wei Wang, and Cho-Jui Hsieh. Universality and limitations of prompt tuning.Advances in Neural Information Processing Systems, 36:75623–75643,

work page
[17]

Personalized image generation with deep generative models: A decade survey.arXiv preprint arXiv:2502.13081, 2025

Yuxiang Wei, Yiheng Zheng, Yabo Zhang, Ming Liu, Zhi- long Ji, Lei Zhang, and Wangmeng Zuo. Personalized image generation with deep generative models: A decade survey. arXiv preprint arXiv:2502.13081, 2025. 2

work page arXiv 2025
[18]

Personalized image generation with large multimodal models

Yiyan Xu, Wenjie Wang, Yang Zhang, Biao Tang, Peng Yan, Fuli Feng, and Xiangnan He. Personalized image generation with large multimodal models. InProceedings of the ACM on Web Conference 2025, pages 264–274, 2025. 2

work page 2025
[19]

Ties-merging: Resolving interference when merging models.Advances in Neural Information Pro- cessing Systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raf- fel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Pro- cessing Systems, 36:7093–7115, 2023. 4, 5

work page 2023
[20]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first Interna- tional Conference on Machine Learning, 2024. 4

work page 2024
[21]

Jedi: Joint- image diffusion models for finetuning-free personalized text- to-image generation

Yu Zeng, Vishal M Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji. Jedi: Joint- image diffusion models for finetuning-free personalized text- to-image generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6786–6795, 2024. 2

work page 2024
[22]

Tip- adapter: Training-free adaption of clip for few-shot classi- fication

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classi- fication. InEuropean conference on computer vision, pages 493–510. Springer, 2022. 2

work page 2022
[23]

Compositional subspace representation fine- tuning for adaptive large language models.arXiv preprint arXiv:2503.10617, 2025

Andy Zhou. Compositional subspace representation fine- tuning for adaptive large language models.arXiv preprint arXiv:2503.10617, 2025. 3

work page arXiv 2025
[24]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825,

work page

[1] [1]

Myvlm: Personalizing vlms for user-specific queries

Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aber- man, and Daniel Cohen-Or. Myvlm: Personalizing vlms for user-specific queries. InEuropean Conference on Computer Vision, pages 73–91. Springer, 2024. 1, 2, 5, 6, 7

work page 2024

[2] [2]

Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 1, 5

work page arXiv 2024

[3] [3]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 5

work page 2021

[4] [4]

Rap: Retrieval-augmented personalization for multimodal large language models

Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, and Xiangyu Yue. Rap: Retrieval-augmented personalization for multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14538–14548, 2025. 1, 5, 6, 7, 8

work page 2025

[5] [5]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2

work page 2022

[6] [6]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 2

work page 2023

[7] [7]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2, 5

work page 2023

[8] [8]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Yo’llava: Your personalized lan- guage and vision assistant.Advances in Neural Information Processing Systems, 37:40913–40951, 2024

Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized lan- guage and vision assistant.Advances in Neural Information Processing Systems, 37:40913–40951, 2024. 1, 2, 3, 5, 6, 7, 8

work page 2024

[10] [10]

Orthogonal adaptation for modular customization of diffusion models

Ryan Po, Guandao Yang, Kfir Aberman, and Gordon Wet- zstein. Orthogonal adaptation for modular customization of diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7964–7973, 2024. 2

work page 2024

[11] [11]

Lora soups: Merg- ing loras for practical skill composition tasks

Akshara Prabhakar, Yuanzhi Li, Karthik Narasimhan, Sham Kakade, Eran Malach, and Samy Jelassi. Lora soups: Merg- ing loras for practical skill composition tasks. InProceed- ings of the 31st International Conference on Computational Linguistics: Industry Track, pages 644–655, 2025. 3

work page 2025

[12] [12]

Multlfg: Training-free multi-lora composi- tion using frequency-domain guidance.arXiv preprint arXiv:2505.20525, 2025

Aniket Roy, Maitreya Suin, Ketul Shah, and Rama Chellappa. Multlfg: Training-free multi-lora composi- tion using frequency-domain guidance.arXiv preprint arXiv:2505.20525, 2025. 3

work page arXiv 2025

[13] [13]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2

work page 2023

[14] [14]

Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Pro- cessing Systems, 37:9565–9584, 2024

Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Cheng- Zhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Pro- cessing Systems, 37:9565–9584, 2024. 2

work page 2024

[15] [15]

Visionllm: Large language model is also an open- ended decoder for vision-centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513, 2023

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open- ended decoder for vision-centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513, 2023. 1, 2

work page 2023

[16] [16]

Universality and limitations of prompt tuning.Advances in Neural Information Processing Systems, 36:75623–75643,

Yihan Wang, Jatin Chauhan, Wei Wang, and Cho-Jui Hsieh. Universality and limitations of prompt tuning.Advances in Neural Information Processing Systems, 36:75623–75643,

work page

[17] [17]

Personalized image generation with deep generative models: A decade survey.arXiv preprint arXiv:2502.13081, 2025

Yuxiang Wei, Yiheng Zheng, Yabo Zhang, Ming Liu, Zhi- long Ji, Lei Zhang, and Wangmeng Zuo. Personalized image generation with deep generative models: A decade survey. arXiv preprint arXiv:2502.13081, 2025. 2

work page arXiv 2025

[18] [18]

Personalized image generation with large multimodal models

Yiyan Xu, Wenjie Wang, Yang Zhang, Biao Tang, Peng Yan, Fuli Feng, and Xiangnan He. Personalized image generation with large multimodal models. InProceedings of the ACM on Web Conference 2025, pages 264–274, 2025. 2

work page 2025

[19] [19]

Ties-merging: Resolving interference when merging models.Advances in Neural Information Pro- cessing Systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raf- fel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in Neural Information Pro- cessing Systems, 36:7093–7115, 2023. 4, 5

work page 2023

[20] [20]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first Interna- tional Conference on Machine Learning, 2024. 4

work page 2024

[21] [21]

Jedi: Joint- image diffusion models for finetuning-free personalized text- to-image generation

Yu Zeng, Vishal M Patel, Haochen Wang, Xun Huang, Ting- Chun Wang, Ming-Yu Liu, and Yogesh Balaji. Jedi: Joint- image diffusion models for finetuning-free personalized text- to-image generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6786–6795, 2024. 2

work page 2024

[22] [22]

Tip- adapter: Training-free adaption of clip for few-shot classi- fication

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classi- fication. InEuropean conference on computer vision, pages 493–510. Springer, 2022. 2

work page 2022

[23] [23]

Compositional subspace representation fine- tuning for adaptive large language models.arXiv preprint arXiv:2503.10617, 2025

Andy Zhou. Compositional subspace representation fine- tuning for adaptive large language models.arXiv preprint arXiv:2503.10617, 2025. 3

work page arXiv 2025

[24] [24]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825,

work page