pith. sign in

arxiv: 2511.11051 · v3 · pith:OHWMMWVSnew · submitted 2025-11-14 · 💻 cs.CV

NP-LoRA: Null Space Projection for Subject-Style LoRA Fusion

Pith reviewed 2026-05-25 07:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords LoRA fusionnull space projectionsubject style compositiondiffusion model adaptationtraining-free mergingparameter subspace interferencelow-rank adapter composition
0
0 comments X

The pith

Null space projection of content LoRA onto the style LoRA's complementary subspace suppresses interference during fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a geometric reformulation of LoRA fusion as the control of overlapping low-rank subspaces rather than simple weight averaging. It introduces NP-LoRA, which projects the content LoRA onto the null space defined by the principal directions of the style LoRA to reduce conflicting updates. A soft version of this projection is derived as the closed-form solution to a regularized optimization that trades off suppression against content preservation. Experiments on multiple pretrained LoRA pairs demonstrate improved balance in subject-style image generation without any additional training. The approach treats fusion as an explicit modulation of cross-subspace interactions instead of post-hoc merging.

Core claim

NP-LoRA defines a projection operator that maps the content LoRA into the orthogonal complement of the dominant directions of the style LoRA, thereby attenuating parameter conflicts along those directions while retaining complementary content information; the soft variant interpolates continuously between ordinary linear merging and strict null-space projection via a single regularization parameter.

What carries the argument

Null-space projection operator that projects the content LoRA matrix onto the orthogonal complement of the principal subspace spanned by the style LoRA.

If this is right

  • Fusion becomes a controllable geometric operation rather than an empirical averaging step.
  • A single scalar parameter governs the strength of style-subspace suppression.
  • The method requires no retraining or additional data for each new LoRA pair.
  • Content information outside the style principal subspace is preserved by construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection logic could be applied in reverse, projecting style onto the null space of content, to test symmetry of the interference.
  • If the subspaces of multiple styles are known, successive projections might allow controlled multi-style composition.
  • The closed-form solution suggests the method could extend to other low-rank adapters beyond diffusion models.

Load-bearing premise

The principal directions extracted from the style LoRA capture the main directions of interference with content updates.

What would settle it

Generate images from the same subject-style LoRA pair using the hard projection, the soft projection at multiple regularization values, and standard merging, then measure whether subject fidelity drops sharply when the projection strength increases.

Figures

Figures reproduced from arXiv: 2511.11051 by Chuheng Chen, Geyuan Zhang, Xiaofei Zhou, Yong Huang.

Figure 1
Figure 1. Figure 1: Illustration of our motivation. (a) The task is to compose a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed method. NP-LoRA takes pretrained content and style LoRAs as inputs. The style LoRA is decomposed via singular value [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Singular value spectrum of a LoRA and perturbation effects. We [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) and (e) are the content and style references, respectively. (c) shows [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of our method. (a) Content image. (b) Style image. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with Direct Weight Merge, B-LoRA, ZipLoRA, K-LoRA, LoRA.rar, and our proposed NP-LoRA, illustrating the trade-off [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of LoRA Projection. (a) Content LoRA training images, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of output-Space (U) and parameter-space (V) projections for null-space construction. The U-space projection fails to remove style [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison with joint training. Joint training exhibits unstable performance and often fails to merge content and style effectively, while [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Our method effectively modifies the object’s actions and environment while maintaining the original style. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Results obtained with randomly selected seeds demonstrate the stability and robustness of our NP-LoRA. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results of NP-LoRA on the Flux backbone using diverse publicly available LoRAs. Each image corresponds to the combination of the [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results of NP-LoRA on the Flux backbone using diverse publicly available LoRAs. Each image corresponds to the combination of the [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
read the original abstract

Low-Rank Adaptation (LoRA) fusion enables the composition of subject and style representations for controllable generation without retraining. However, existing approaches primarily operate through weight-level merging, without explicitly modeling how independently trained LoRAs interact in the shared parameter space. We adopt a geometric perspective on LoRA fusion, interpreting content and style LoRAs as occupying overlapping, non-orthogonal low-rank subspaces, where such overlap can lead to conflicting parameter updates that affect generation quality. This observation motivates us to reformulate LoRA fusion not merely as parameter combination, but as a problem of controlling how updates from overlapping subspaces are combined. Based on this insight, we propose Null Space Projection LoRA (NP-LoRA), a training-free framework that employs projection as a fusion operator to explicitly modulate cross-LoRA interactions. Specifically, NP-LoRA uses principal directions of the style LoRA to define a projection subspace and projects the content LoRA onto the complementary subspace (i.e., the null space of the style LoRA), suppressing interference along dominant style directions while preserving complementary information. To avoid the overly aggressive suppression of hard projection, we further formulate soft projection as a regularized optimization problem that balances content preservation against style-subspace suppression. This objective admits a closed-form solution, yielding a projection operator controlled by a single parameter that continuously interpolates between linear merging and hard projection. Extensive experiments across multiple pretrained LoRA pairs show that NP-LoRA achieves more balanced content-style composition compared to strong baselines, without requiring retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes NP-LoRA, a training-free geometric method for fusing independently trained subject and style LoRAs. It interprets the LoRAs as occupying overlapping low-rank subspaces and defines a projection operator that projects the content LoRA onto the orthogonal complement of the top singular vectors of the style LoRA matrix, thereby suppressing interference along dominant style directions. A soft-projection variant is derived as the closed-form solution to a regularized optimization problem controlled by a single scalar that interpolates between linear merging and hard projection. The central claim is that this yields more balanced content-style composition than existing weight-merging baselines across multiple pretrained LoRA pairs.

Significance. If the subspace separation assumption holds and the reported improvements are reproducible, the approach would supply a lightweight, parameter-efficient operator for controllable generation that avoids retraining or additional fine-tuning. The closed-form soft-projection solution and the explicit modeling of non-orthogonality are technically clean contributions that could be adopted in other low-rank adaptation settings.

major comments (3)
  1. [Abstract] Abstract and experimental claims: the assertion that NP-LoRA 'achieves more balanced content-style composition compared to strong baselines' is presented without any quantitative metrics, error bars, dataset sizes, or subject-consistency scores. This absence is load-bearing because the central claim rests entirely on an unverified experimental assertion rather than on verifiable numbers.
  2. [Method] Method section (soft-projection formulation): the single tunable scalar that controls the regularized projection is a free parameter; the manuscript does not state whether its value is chosen by cross-validation on the same evaluation set used to report results. If so, this introduces a circularity that undermines the claim of training-free superiority.
  3. [Method] Geometric construction: the claim that the top singular vectors of the style LoRA isolate the interference subspace while leaving subject-specific content directions largely intact is not accompanied by any diagnostic (e.g., cosine overlap between content and style singular vectors or rank preservation after projection). Without such a check the weakest assumption remains untested and the null-space guarantee is not established.
minor comments (2)
  1. [Method] Notation for the projection operator and the regularization parameter should be introduced with explicit definitions and ranges before the closed-form derivation is presented.
  2. [Abstract] The abstract states 'extensive experiments across multiple pretrained LoRA pairs' but supplies no table or figure reference; a results table with per-pair metrics would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, with clarifications on the abstract, parameter selection, and geometric assumptions, and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental claims: the assertion that NP-LoRA 'achieves more balanced content-style composition compared to strong baselines' is presented without any quantitative metrics, error bars, dataset sizes, or subject-consistency scores. This absence is load-bearing because the central claim rests entirely on an unverified experimental assertion rather than on verifiable numbers.

    Authors: The abstract is intended as a concise overview; the full manuscript reports quantitative results including subject-consistency scores, style fidelity metrics, and comparisons over multiple LoRA pairs with dataset details. To address the concern directly, we will revise the abstract to include key quantitative highlights such as average improvements and evaluation scale. revision: yes

  2. Referee: [Method] Method section (soft-projection formulation): the single tunable scalar that controls the regularized projection is a free parameter; the manuscript does not state whether its value is chosen by cross-validation on the same evaluation set used to report results. If so, this introduces a circularity that undermines the claim of training-free superiority.

    Authors: The scalar is chosen via limited visual inspection on a small held-out set disjoint from the reported evaluation data, consistent with the training-free fusion claim. We will explicitly document this procedure in the revised method section to remove ambiguity. revision: yes

  3. Referee: [Method] Geometric construction: the claim that the top singular vectors of the style LoRA isolate the interference subspace while leaving subject-specific content directions largely intact is not accompanied by any diagnostic (e.g., cosine overlap between content and style singular vectors or rank preservation after projection). Without such a check the weakest assumption remains untested and the null-space guarantee is not established.

    Authors: We agree that direct diagnostics would strengthen the geometric claims. We will add cosine-similarity analysis between content and style singular vectors together with post-projection rank statistics in a new appendix or results subsection. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation introduces a geometric view of LoRA subspaces and defines NP-LoRA via principal directions of the style LoRA to construct a projection operator (hard and soft variants with closed-form solution). This construction is independent of the reported experimental outcomes; the performance claims rest on external evaluation across multiple pretrained LoRA pairs rather than any fitted parameter or self-referential definition. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The single tunable parameter in soft projection is presented as part of the method definition, not as a post-hoc fit renamed as prediction. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the geometric modeling of LoRA subspaces as overlapping and non-orthogonal, plus the modeling choice that principal directions of the style LoRA define the interference subspace. One free parameter controls the soft-projection strength.

free parameters (1)
  • soft-projection regularization parameter
    Single scalar that interpolates between linear merging and hard projection; its value is chosen to balance content preservation and style-subspace suppression.
axioms (1)
  • domain assumption Content and style LoRAs occupy overlapping, non-orthogonal low-rank subspaces whose overlap produces conflicting parameter updates.
    Invoked in the opening geometric perspective paragraph of the abstract.

pith-pipeline@v0.9.0 · 5810 in / 1232 out tokens · 26101 ms · 2026-05-25T07:56:00.276946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 3 internal anchors

  1. [1]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

  2. [2]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840– 6851, 2020

  3. [3]

    Denoising diffusion implicit models,

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=St1giarCHLP

  4. [4]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 8780–

  5. [5]

    Available: https://proceedings.neurips.cc/paper files/ paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf

    [Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf

  6. [6]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2022, pp. 10 684–10 695

  7. [7]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510. 8

  8. [8]

    Ziplora: Any subject in any style by effectively merging loras,

    V . Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y . Li, and V . Jampani, “Ziplora: Any subject in any style by effectively merging loras,” in European Conference on Computer Vision. Springer, 2024, pp. 422– 438

  9. [9]

    K-lora: Unlocking training-free fusion of any subject and style loras,

    Z. Ouyang, Z. Li, and Q. Hou, “K-lora: Unlocking training-free fusion of any subject and style loras,” inCVPR, 2025

  10. [10]

    Lora.rar: Learning to merge loras via hypernetworks for subject-style conditioned image generation,

    D. Shenaj, O. Bohdal, M. Ozay, P. Zanuttigh, and U. Michieli, “Lora.rar: Learning to merge loras via hypernetworks for subject-style conditioned image generation,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), October 2025

  11. [11]

    A neural space- time representation for text-to-image personalization,

    Y . Alaluf, E. Richardson, G. Metzer, and D. Cohen-Or, “A neural space- time representation for text-to-image personalization,”ACM Transac- tions on Graphics (TOG), vol. 42, no. 6, pp. 1–10, 2023

  12. [12]

    p+: Ex- tended textual conditioning in text-to-image generation,

    A. V oynov, Q. Chu, D. Cohen-Or, and K. Aberman, “p+: Ex- tended textual conditioning in text-to-image generation,”arXiv preprint arXiv:2303.09522, 2023

  13. [13]

    Inversion-based style transfer with diffusion models,

    Y . Zhang, N. Huang, F. Tang, H. Huang, C. Ma, W. Dong, and C. Xu, “Inversion-based style transfer with diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10 146–10 156

  14. [14]

    Multi- concept customization of text-to-image diffusion,

    N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi- concept customization of text-to-image diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1931–1941

  15. [15]

    Break-a-scene: Extracting multiple concepts from a single image,

    O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and D. Lischinski, “Break-a-scene: Extracting multiple concepts from a single image,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–12

  16. [16]

    Instantbooth: Personalized text-to-image generation without test-time finetuning,

    J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 8543–8552

  17. [17]

    Fastcomposer: Tuning-free multi-subject image generation with localized attention,

    G. Xiao, T. Yin, W. T. Freeman, F. Durand, and S. Han, “Fastcomposer: Tuning-free multi-subject image generation with localized attention,” International Journal of Computer Vision, vol. 133, no. 3, pp. 1175– 1194, 2025

  18. [18]

    Smartbrush: Text and shape guided object inpainting with diffusion model,

    S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 428–22 437

  19. [19]

    Lora+: Efficient low rank adaptation of large models,

    S. Hayou, N. Ghosh, and B. Yu, “Lora+: Efficient low rank adaptation of large models,”arXiv preprint arXiv:2402.12354, 2024

  20. [20]

    Vera: Vector-based random matrix adaptation,

    D. J. Kopiczko, T. Blankevoort, and Y . M. Asano, “Vera: Vector-based random matrix adaptation,”arXiv preprint arXiv:2310.11454, 2023

  21. [21]

    Melora: Mini-ensemble low-rank adapters for parameter-efficient fine-tuning,

    P. Ren, C. Shi, S. Wu, M. Zhang, Z. Ren, M. de Rijke, Z. Chen, and J. Pei, “Melora: Mini-ensemble low-rank adapters for parameter-efficient fine-tuning,”arXiv preprint arXiv:2402.17263, 2024

  22. [22]

    LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

    L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li, “Lora-fa: Memory- efficient low-rank adaptation for large language models fine-tuning,” arXiv preprint arXiv:2308.03303, 2023

  23. [23]

    Lora-drop: Efficient lora parameter pruning based on output evaluation,

    H. Zhou, X. Lu, W. Xu, C. Zhu, T. Zhao, and M. Yang, “Lora-drop: Efficient lora parameter pruning based on output evaluation,”arXiv preprint arXiv:2402.07721, 2024

  24. [24]

    Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices,

    B. Zi, X. Qi, L. Wang, J. Wang, K.-F. Wong, and L. Zhang, “Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices,” arXiv preprint arXiv:2309.02411, 2023

  25. [25]

    Duolora : Cycle-consistent and rank-disentangled content-style personalization,

    A. Roy, S. Borse, S. Kadambi, D. Das, S. Mahajan, R. Garrepalli, H. Park, A. Nayak, R. Chellappa, M. Hayat, and F. Porikli, “Duolora : Cycle-consistent and rank-disentangled content-style personalization,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 15 395–15 404

  26. [26]

    Implicit style- content separation using b-lora,

    Y . Frenkel, Y . Vinker, A. Shamir, and D. Cohen-Or, “Implicit style- content separation using b-lora,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 181–198

  27. [27]

    Csgo: Content-style composition in text-to-image generation,

    P. Xing, H. Wang, Y . Sun, Q. Wang, X. Bai, H. Ai, R. Huang, and Z. Li, “Csgo: Content-style composition in text-to-image generation,” arXiv preprint arXiv:2408.16766, 2024

  28. [28]

    How to continually adapt text-to-image diffusion models for flexible customization?

    J. Dong, W. Liang, H. Li, D. Zhang, M. Cao, H. Ding, S. H. Khan, and F. Shahbaz Khan, “How to continually adapt text-to-image diffusion models for flexible customization?”Advances in Neural Information Processing Systems, vol. 37, pp. 130 057–130 083, 2024

  29. [29]

    Mix-of-show: Decentralized low-rank adap- tation for multi-concept customization of diffusion models,

    Y . Gu, X. Wang, J. Z. Wu, Y . Shi, Y . Chen, Z. Fan, W. Xiao, R. Zhao, S. Chang, W. Wu,et al., “Mix-of-show: Decentralized low-rank adap- tation for multi-concept customization of diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 15 890–15 902, 2023

  30. [30]

    Mcˆ 2: Multi-concept guidance for customized multi-concept genera- tion,

    J. Jiang, Y . Zhang, K. Feng, X. Wu, W. Li, R. Pei, F. Li, and W. Zuo, “Mcˆ 2: Multi-concept guidance for customized multi-concept genera- tion,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2802–2812

  31. [31]

    Cones: Concept neurons in diffusion models for customized generation,

    Z. Liu, R. Feng, K. Zhu, Y . Zhang, K. Zheng, Y . Liu, D. Zhao, J. Zhou, and Y . Cao, “Cones: Concept neurons in diffusion models for customized generation,”arXiv preprint arXiv:2303.05125, 2023

  32. [32]

    Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models,

    Y . Yang, W. Wang, L. Peng, C. Song, Y . Chen, H. Li, X. Yang, Q. Lu, D. Cai, B. Wu,et al., “Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models,”arXiv preprint arXiv:2403.11627, 2024

  33. [33]

    Multi-lora composition for image generation,

    M. Zhong, Y . Shen, S. Wang, Y . Lu, Y . Jiao, S. Ouyang, D. Yu, J. Han, and W. Chen, “Multi-lora composition for image generation,” Transactions on Machine Learning Research, vol. 2024, 2024

  34. [34]

    Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout,

    A. Zhang, X. Ding, H. Wang, S. McDonagh, and S. Kaski, “Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout,”arXiv preprint arXiv:2510.03262, 2025

  35. [35]

    Subject or style: Adaptive and training- free mixture of loras,

    J.-C. Zhang and Y .-J. Xiong, “Subject or style: Adaptive and training- free mixture of loras,”arXiv preprint arXiv:2508.02165, 2025

  36. [36]

    Model merging with svd to tie the knots,

    G. Stoica, P. Ramesh, B. Ecsedi, L. Choshen, and J. Hoffman, “Model merging with svd to tie the knots,” inThe Thirteenth International Conference on Learning Representations

  37. [37]

    Subzero: Composing subject, style, and action via zero-shot personalization,

    S. Borse, K. Bhardwaj, M. R. K. Dastjerdi, H. Park, S. Kadambi, S. Shiv- akumar, P. Mandke, A. Nayak, H. Teague, M. Hayat,et al., “Subzero: Composing subject, style, and action via zero-shot personalization,” arXiv preprint arXiv:2502.19673, 2025

  38. [38]

    Zero-shot adaptation of parameter-efficient fine-tuning in diffusion models,

    F. Farhadzadeh, D. Das, S. Borse, and F. Porikli, “Zero-shot adaptation of parameter-efficient fine-tuning in diffusion models,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Eds., vol....

  39. [39]

    Alphaedit: Null-space constrained knowledge editing for lan- guage models,

    J. Fang, H. Jiang, K. Wang, Y . Ma, J. Shi, X. Wang, X. He, and T.-S. Chua, “Alphaedit: Null-space constrained knowledge editing for lan- guage models,” inThe Thirteenth International Conference on Learning Representations

  40. [40]

    Lion-lora: Rethinking lora fusion to unify controllable spatial and temporal generation for video diffusion,

    Y . Zhang, C. Cao, C. Yu, and J. Zhu, “Lion-lora: Rethinking lora fusion to unify controllable spatial and temporal generation for video diffusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 569–14 579

  41. [41]

    Lora-null: Low-rank adaptation via null space for large language models,

    P. Tang, Y . Liu, D. Zhang, X. Wu, and D. Zhang, “Lora-null: Low-rank adaptation via null space for large language models,”arXiv preprint arXiv:2503.02659, 2025

  42. [42]

    Image style transfer using convolutional neural networks,

    L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2414–2423

  43. [43]

    Arbitrary style transfer in real-time with adaptive instance normalization,

    X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” inProceedings of the IEEE interna- tional conference on computer vision, 2017, pp. 1501–1510

  44. [44]

    Universal style transfer via feature transforms,

    Y . Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,”Advances in neural information processing systems, vol. 30, 2017

  45. [45]

    Tikhonov and V

    A. Tikhonov and V . Arsenin,Solutions of Ill-posed Problems, ser. Halsted Press book. Winston, 1977. [Online]. Available: https://books.google.co.jp/books?id=ECrvAAAAMAAJ

  46. [46]

    Woodbury and P

    M. Woodbury and P. U. D. of Statistics,Inverting Modified Matrices, ser. Memorandum Report / Statistical Research Group, Princeton. Department of Statistics, Princeton University, 1950. [Online]. Available: https://books.google.co.jp/books?id= zAnzgEACAAJ

  47. [47]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510

  48. [48]

    Styledrop: Text-to-image generation in any style,

    K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y . Li,et al., “Styledrop: Text-to-image generation in any style,”arXiv preprint arXiv:2306.00983, 2023

  49. [49]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023

  50. [50]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Ed...

  51. [51]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,”arXiv preprint arXiv:2203.03605, 2022

  52. [52]

    From clip to dino: Visual encoders shout in multi-modal large language models,

    D. Jiang, Y . Liu, S. Liu, J. Zhao, H. Zhang, Z. Gao, X. Zhang, J. Li, and H. Xiong, “From clip to dino: Visual encoders shout in multi-modal large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2310.08825

  53. [53]

    Van Rijsbergen,Information Retrieval

    C. Van Rijsbergen,Information Retrieval. Butterworths, 1979. [Online]. Available: https://books.google.co.jp/books?id=t-pTAAAAMAAJ

  54. [54]

    Muc-4 evaluation metrics,

    N. Chinchor, “Muc-4 evaluation metrics,” inProceedings of the 4th Conference on Message Understanding, ser. MUC4 ’92. USA: Association for Computational Linguistics, 1992, p. 22–29. [Online]. Available: https://doi.org/10.3115/1072064.1072067

  55. [55]

    Performance measures and a data set for multi-target, multi-camera tracking,

    E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2016, pp. 17–35, proposes IDF1, a harmonic mean of ID precision and recall for tracking evaluation

  56. [56]

    Zero-shot learning – the good, the bad and the ugly,

    Y . Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning – the good, the bad and the ugly,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4582– 4591, introduces the harmonic mean (H-score) of seen/unseen accuracies for balanced evaluation in generalized zero-shot learning. SUPPLEMENTARYMATE...