NP-LoRA: Null Space Projection for Subject-Style LoRA Fusion

Chuheng Chen; Geyuan Zhang; Xiaofei Zhou; Yong Huang

arxiv: 2511.11051 · v3 · pith:OHWMMWVSnew · submitted 2025-11-14 · 💻 cs.CV

NP-LoRA: Null Space Projection for Subject-Style LoRA Fusion

Chuheng Chen , Xiaofei Zhou , Geyuan Zhang , Yong Huang This is my paper

Pith reviewed 2026-05-25 07:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords LoRA fusionnull space projectionsubject style compositiondiffusion model adaptationtraining-free mergingparameter subspace interferencelow-rank adapter composition

0 comments

The pith

Null space projection of content LoRA onto the style LoRA's complementary subspace suppresses interference during fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a geometric reformulation of LoRA fusion as the control of overlapping low-rank subspaces rather than simple weight averaging. It introduces NP-LoRA, which projects the content LoRA onto the null space defined by the principal directions of the style LoRA to reduce conflicting updates. A soft version of this projection is derived as the closed-form solution to a regularized optimization that trades off suppression against content preservation. Experiments on multiple pretrained LoRA pairs demonstrate improved balance in subject-style image generation without any additional training. The approach treats fusion as an explicit modulation of cross-subspace interactions instead of post-hoc merging.

Core claim

NP-LoRA defines a projection operator that maps the content LoRA into the orthogonal complement of the dominant directions of the style LoRA, thereby attenuating parameter conflicts along those directions while retaining complementary content information; the soft variant interpolates continuously between ordinary linear merging and strict null-space projection via a single regularization parameter.

What carries the argument

Null-space projection operator that projects the content LoRA matrix onto the orthogonal complement of the principal subspace spanned by the style LoRA.

If this is right

Fusion becomes a controllable geometric operation rather than an empirical averaging step.
A single scalar parameter governs the strength of style-subspace suppression.
The method requires no retraining or additional data for each new LoRA pair.
Content information outside the style principal subspace is preserved by construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection logic could be applied in reverse, projecting style onto the null space of content, to test symmetry of the interference.
If the subspaces of multiple styles are known, successive projections might allow controlled multi-style composition.
The closed-form solution suggests the method could extend to other low-rank adapters beyond diffusion models.

Load-bearing premise

The principal directions extracted from the style LoRA capture the main directions of interference with content updates.

What would settle it

Generate images from the same subject-style LoRA pair using the hard projection, the soft projection at multiple regularization values, and standard merging, then measure whether subject fidelity drops sharply when the projection strength increases.

Figures

Figures reproduced from arXiv: 2511.11051 by Chuheng Chen, Geyuan Zhang, Xiaofei Zhou, Yong Huang.

**Figure 2.** Figure 2: Overview of the proposed method. NP-LoRA takes pretrained content and style LoRAs as inputs. The style LoRA is decomposed via singular value [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Singular value spectrum of a LoRA and perturbation effects. We [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: (a) and (e) are the content and style references, respectively. (c) shows [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of our method. (a) Content image. (b) Style image. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison with Direct Weight Merge, B-LoRA, ZipLoRA, K-LoRA, LoRA.rar, and our proposed NP-LoRA, illustrating the trade-off [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of LoRA Projection. (a) Content LoRA training images, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of output-Space (U) and parameter-space (V) projections for null-space construction. The U-space projection fails to remove style [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison with joint training. Joint training exhibits unstable performance and often fails to merge content and style effectively, while [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Our method effectively modifies the object’s actions and environment while maintaining the original style. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Results obtained with randomly selected seeds demonstrate the stability and robustness of our NP-LoRA. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative results of NP-LoRA on the Flux backbone using diverse publicly available LoRAs. Each image corresponds to the combination of the [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative results of NP-LoRA on the Flux backbone using diverse publicly available LoRAs. Each image corresponds to the combination of the [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

read the original abstract

Low-Rank Adaptation (LoRA) fusion enables the composition of subject and style representations for controllable generation without retraining. However, existing approaches primarily operate through weight-level merging, without explicitly modeling how independently trained LoRAs interact in the shared parameter space. We adopt a geometric perspective on LoRA fusion, interpreting content and style LoRAs as occupying overlapping, non-orthogonal low-rank subspaces, where such overlap can lead to conflicting parameter updates that affect generation quality. This observation motivates us to reformulate LoRA fusion not merely as parameter combination, but as a problem of controlling how updates from overlapping subspaces are combined. Based on this insight, we propose Null Space Projection LoRA (NP-LoRA), a training-free framework that employs projection as a fusion operator to explicitly modulate cross-LoRA interactions. Specifically, NP-LoRA uses principal directions of the style LoRA to define a projection subspace and projects the content LoRA onto the complementary subspace (i.e., the null space of the style LoRA), suppressing interference along dominant style directions while preserving complementary information. To avoid the overly aggressive suppression of hard projection, we further formulate soft projection as a regularized optimization problem that balances content preservation against style-subspace suppression. This objective admits a closed-form solution, yielding a projection operator controlled by a single parameter that continuously interpolates between linear merging and hard projection. Extensive experiments across multiple pretrained LoRA pairs show that NP-LoRA achieves more balanced content-style composition compared to strong baselines, without requiring retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NP-LoRA introduces a null-space projection operator for LoRA fusion with a closed-form soft variant, but the abstract gives no numbers to back the balance claim.

read the letter

The paper's main move is to treat LoRA fusion as a subspace problem and project the content LoRA onto the null space of the style LoRA's top singular vectors. It adds a soft version that solves a regularized objective in closed form, controlled by one scalar that sits between plain linear merge and hard projection. That operator is not in the cited merging papers, so the geometric framing and the closed-form step are the actual additions. The training-free nature is practical for anyone already holding separate subject and style adapters. The view that overlapping low-rank updates cause conflicts is a reasonable way to motivate the projection. The soft form avoids the obvious risk of over-suppressing content. The central assumption, however, is that the style principal directions cleanly isolate the interference while leaving subject-specific directions intact. The abstract supplies no cosine overlap numbers, no pre/post subject-consistency scores, and no dataset or metric details to check whether that separation holds for real LoRA pairs. If the subspaces overlap more than expected, the projection could quietly drop content fidelity, and the single free parameter does not fix that. The stress-test note on possible loss of content-critical directions therefore lands on the given description. This is aimed at people already working on LoRA merging or controllable diffusion models who want a simple geometric knob. A reader could try the operator on their own pairs even without the paper's numbers. It deserves peer review because the formulation is distinct and the math is explicit; referees can then see whether the experiments actually verify the separation assumption or just assert balanced outputs.

Referee Report

3 major / 2 minor

Summary. The paper proposes NP-LoRA, a training-free geometric method for fusing independently trained subject and style LoRAs. It interprets the LoRAs as occupying overlapping low-rank subspaces and defines a projection operator that projects the content LoRA onto the orthogonal complement of the top singular vectors of the style LoRA matrix, thereby suppressing interference along dominant style directions. A soft-projection variant is derived as the closed-form solution to a regularized optimization problem controlled by a single scalar that interpolates between linear merging and hard projection. The central claim is that this yields more balanced content-style composition than existing weight-merging baselines across multiple pretrained LoRA pairs.

Significance. If the subspace separation assumption holds and the reported improvements are reproducible, the approach would supply a lightweight, parameter-efficient operator for controllable generation that avoids retraining or additional fine-tuning. The closed-form soft-projection solution and the explicit modeling of non-orthogonality are technically clean contributions that could be adopted in other low-rank adaptation settings.

major comments (3)

[Abstract] Abstract and experimental claims: the assertion that NP-LoRA 'achieves more balanced content-style composition compared to strong baselines' is presented without any quantitative metrics, error bars, dataset sizes, or subject-consistency scores. This absence is load-bearing because the central claim rests entirely on an unverified experimental assertion rather than on verifiable numbers.
[Method] Method section (soft-projection formulation): the single tunable scalar that controls the regularized projection is a free parameter; the manuscript does not state whether its value is chosen by cross-validation on the same evaluation set used to report results. If so, this introduces a circularity that undermines the claim of training-free superiority.
[Method] Geometric construction: the claim that the top singular vectors of the style LoRA isolate the interference subspace while leaving subject-specific content directions largely intact is not accompanied by any diagnostic (e.g., cosine overlap between content and style singular vectors or rank preservation after projection). Without such a check the weakest assumption remains untested and the null-space guarantee is not established.

minor comments (2)

[Method] Notation for the projection operator and the regularization parameter should be introduced with explicit definitions and ranges before the closed-form derivation is presented.
[Abstract] The abstract states 'extensive experiments across multiple pretrained LoRA pairs' but supplies no table or figure reference; a results table with per-pair metrics would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, with clarifications on the abstract, parameter selection, and geometric assumptions, and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract and experimental claims: the assertion that NP-LoRA 'achieves more balanced content-style composition compared to strong baselines' is presented without any quantitative metrics, error bars, dataset sizes, or subject-consistency scores. This absence is load-bearing because the central claim rests entirely on an unverified experimental assertion rather than on verifiable numbers.

Authors: The abstract is intended as a concise overview; the full manuscript reports quantitative results including subject-consistency scores, style fidelity metrics, and comparisons over multiple LoRA pairs with dataset details. To address the concern directly, we will revise the abstract to include key quantitative highlights such as average improvements and evaluation scale. revision: yes
Referee: [Method] Method section (soft-projection formulation): the single tunable scalar that controls the regularized projection is a free parameter; the manuscript does not state whether its value is chosen by cross-validation on the same evaluation set used to report results. If so, this introduces a circularity that undermines the claim of training-free superiority.

Authors: The scalar is chosen via limited visual inspection on a small held-out set disjoint from the reported evaluation data, consistent with the training-free fusion claim. We will explicitly document this procedure in the revised method section to remove ambiguity. revision: yes
Referee: [Method] Geometric construction: the claim that the top singular vectors of the style LoRA isolate the interference subspace while leaving subject-specific content directions largely intact is not accompanied by any diagnostic (e.g., cosine overlap between content and style singular vectors or rank preservation after projection). Without such a check the weakest assumption remains untested and the null-space guarantee is not established.

Authors: We agree that direct diagnostics would strengthen the geometric claims. We will add cosine-similarity analysis between content and style singular vectors together with post-projection rank statistics in a new appendix or results subsection. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation introduces a geometric view of LoRA subspaces and defines NP-LoRA via principal directions of the style LoRA to construct a projection operator (hard and soft variants with closed-form solution). This construction is independent of the reported experimental outcomes; the performance claims rest on external evaluation across multiple pretrained LoRA pairs rather than any fitted parameter or self-referential definition. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The single tunable parameter in soft projection is presented as part of the method definition, not as a post-hoc fit renamed as prediction. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the geometric modeling of LoRA subspaces as overlapping and non-orthogonal, plus the modeling choice that principal directions of the style LoRA define the interference subspace. One free parameter controls the soft-projection strength.

free parameters (1)

soft-projection regularization parameter
Single scalar that interpolates between linear merging and hard projection; its value is chosen to balance content preservation and style-subspace suppression.

axioms (1)

domain assumption Content and style LoRAs occupy overlapping, non-orthogonal low-rank subspaces whose overlap produces conflicting parameter updates.
Invoked in the opening geometric perspective paragraph of the abstract.

pith-pipeline@v0.9.0 · 5810 in / 1232 out tokens · 26101 ms · 2026-05-25T07:56:00.276946+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 3 internal anchors

[1]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022
[2]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840– 6851, 2020

work page 2020
[3]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=St1giarCHLP

work page 2021
[4]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 8780–

work page 2021
[5]

Available: https://proceedings.neurips.cc/paper files/ paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf

[Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf

work page 2021
[6]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2022, pp. 10 684–10 695

work page 2022
[7]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510. 8

work page 2023
[8]

Ziplora: Any subject in any style by effectively merging loras,

V . Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y . Li, and V . Jampani, “Ziplora: Any subject in any style by effectively merging loras,” in European Conference on Computer Vision. Springer, 2024, pp. 422– 438

work page 2024
[9]

K-lora: Unlocking training-free fusion of any subject and style loras,

Z. Ouyang, Z. Li, and Q. Hou, “K-lora: Unlocking training-free fusion of any subject and style loras,” inCVPR, 2025

work page 2025
[10]

Lora.rar: Learning to merge loras via hypernetworks for subject-style conditioned image generation,

D. Shenaj, O. Bohdal, M. Ozay, P. Zanuttigh, and U. Michieli, “Lora.rar: Learning to merge loras via hypernetworks for subject-style conditioned image generation,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), October 2025

work page 2025
[11]

A neural space- time representation for text-to-image personalization,

Y . Alaluf, E. Richardson, G. Metzer, and D. Cohen-Or, “A neural space- time representation for text-to-image personalization,”ACM Transac- tions on Graphics (TOG), vol. 42, no. 6, pp. 1–10, 2023

work page 2023
[12]

p+: Ex- tended textual conditioning in text-to-image generation,

A. V oynov, Q. Chu, D. Cohen-Or, and K. Aberman, “p+: Ex- tended textual conditioning in text-to-image generation,”arXiv preprint arXiv:2303.09522, 2023

work page arXiv 2023
[13]

Inversion-based style transfer with diffusion models,

Y . Zhang, N. Huang, F. Tang, H. Huang, C. Ma, W. Dong, and C. Xu, “Inversion-based style transfer with diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10 146–10 156

work page 2023
[14]

Multi- concept customization of text-to-image diffusion,

N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi- concept customization of text-to-image diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1931–1941

work page 2023
[15]

Break-a-scene: Extracting multiple concepts from a single image,

O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and D. Lischinski, “Break-a-scene: Extracting multiple concepts from a single image,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–12

work page 2023
[16]

Instantbooth: Personalized text-to-image generation without test-time finetuning,

J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 8543–8552

work page 2024
[17]

Fastcomposer: Tuning-free multi-subject image generation with localized attention,

G. Xiao, T. Yin, W. T. Freeman, F. Durand, and S. Han, “Fastcomposer: Tuning-free multi-subject image generation with localized attention,” International Journal of Computer Vision, vol. 133, no. 3, pp. 1175– 1194, 2025

work page 2025
[18]

Smartbrush: Text and shape guided object inpainting with diffusion model,

S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 428–22 437

work page 2023
[19]

Lora+: Efficient low rank adaptation of large models,

S. Hayou, N. Ghosh, and B. Yu, “Lora+: Efficient low rank adaptation of large models,”arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024
[20]

Vera: Vector-based random matrix adaptation,

D. J. Kopiczko, T. Blankevoort, and Y . M. Asano, “Vera: Vector-based random matrix adaptation,”arXiv preprint arXiv:2310.11454, 2023

work page arXiv 2023
[21]

Melora: Mini-ensemble low-rank adapters for parameter-efficient fine-tuning,

P. Ren, C. Shi, S. Wu, M. Zhang, Z. Ren, M. de Rijke, Z. Chen, and J. Pei, “Melora: Mini-ensemble low-rank adapters for parameter-efficient fine-tuning,”arXiv preprint arXiv:2402.17263, 2024

work page arXiv 2024
[22]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li, “Lora-fa: Memory- efficient low-rank adaptation for large language models fine-tuning,” arXiv preprint arXiv:2308.03303, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Lora-drop: Efficient lora parameter pruning based on output evaluation,

H. Zhou, X. Lu, W. Xu, C. Zhu, T. Zhao, and M. Yang, “Lora-drop: Efficient lora parameter pruning based on output evaluation,”arXiv preprint arXiv:2402.07721, 2024

work page arXiv 2024
[24]

Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices,

B. Zi, X. Qi, L. Wang, J. Wang, K.-F. Wong, and L. Zhang, “Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices,” arXiv preprint arXiv:2309.02411, 2023

work page arXiv 2023
[25]

Duolora : Cycle-consistent and rank-disentangled content-style personalization,

A. Roy, S. Borse, S. Kadambi, D. Das, S. Mahajan, R. Garrepalli, H. Park, A. Nayak, R. Chellappa, M. Hayat, and F. Porikli, “Duolora : Cycle-consistent and rank-disentangled content-style personalization,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 15 395–15 404

work page 2025
[26]

Implicit style- content separation using b-lora,

Y . Frenkel, Y . Vinker, A. Shamir, and D. Cohen-Or, “Implicit style- content separation using b-lora,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 181–198

work page 2024
[27]

Csgo: Content-style composition in text-to-image generation,

P. Xing, H. Wang, Y . Sun, Q. Wang, X. Bai, H. Ai, R. Huang, and Z. Li, “Csgo: Content-style composition in text-to-image generation,” arXiv preprint arXiv:2408.16766, 2024

work page arXiv 2024
[28]

How to continually adapt text-to-image diffusion models for flexible customization?

J. Dong, W. Liang, H. Li, D. Zhang, M. Cao, H. Ding, S. H. Khan, and F. Shahbaz Khan, “How to continually adapt text-to-image diffusion models for flexible customization?”Advances in Neural Information Processing Systems, vol. 37, pp. 130 057–130 083, 2024

work page 2024
[29]

Mix-of-show: Decentralized low-rank adap- tation for multi-concept customization of diffusion models,

Y . Gu, X. Wang, J. Z. Wu, Y . Shi, Y . Chen, Z. Fan, W. Xiao, R. Zhao, S. Chang, W. Wu,et al., “Mix-of-show: Decentralized low-rank adap- tation for multi-concept customization of diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 15 890–15 902, 2023

work page 2023
[30]

Mcˆ 2: Multi-concept guidance for customized multi-concept genera- tion,

J. Jiang, Y . Zhang, K. Feng, X. Wu, W. Li, R. Pei, F. Li, and W. Zuo, “Mcˆ 2: Multi-concept guidance for customized multi-concept genera- tion,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2802–2812

work page 2025
[31]

Cones: Concept neurons in diffusion models for customized generation,

Z. Liu, R. Feng, K. Zhu, Y . Zhang, K. Zheng, Y . Liu, D. Zhao, J. Zhou, and Y . Cao, “Cones: Concept neurons in diffusion models for customized generation,”arXiv preprint arXiv:2303.05125, 2023

work page arXiv 2023
[32]

Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models,

Y . Yang, W. Wang, L. Peng, C. Song, Y . Chen, H. Li, X. Yang, Q. Lu, D. Cai, B. Wu,et al., “Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models,”arXiv preprint arXiv:2403.11627, 2024

work page arXiv 2024
[33]

Multi-lora composition for image generation,

M. Zhong, Y . Shen, S. Wang, Y . Lu, Y . Jiao, S. Ouyang, D. Yu, J. Han, and W. Chen, “Multi-lora composition for image generation,” Transactions on Machine Learning Research, vol. 2024, 2024

work page 2024
[34]

Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout,

A. Zhang, X. Ding, H. Wang, S. McDonagh, and S. Kaski, “Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout,”arXiv preprint arXiv:2510.03262, 2025

work page arXiv 2025
[35]

Subject or style: Adaptive and training- free mixture of loras,

J.-C. Zhang and Y .-J. Xiong, “Subject or style: Adaptive and training- free mixture of loras,”arXiv preprint arXiv:2508.02165, 2025

work page arXiv 2025
[36]

Model merging with svd to tie the knots,

G. Stoica, P. Ramesh, B. Ecsedi, L. Choshen, and J. Hoffman, “Model merging with svd to tie the knots,” inThe Thirteenth International Conference on Learning Representations

work page
[37]

Subzero: Composing subject, style, and action via zero-shot personalization,

S. Borse, K. Bhardwaj, M. R. K. Dastjerdi, H. Park, S. Kadambi, S. Shiv- akumar, P. Mandke, A. Nayak, H. Teague, M. Hayat,et al., “Subzero: Composing subject, style, and action via zero-shot personalization,” arXiv preprint arXiv:2502.19673, 2025

work page arXiv 2025
[38]

Zero-shot adaptation of parameter-efficient fine-tuning in diffusion models,

F. Farhadzadeh, D. Das, S. Borse, and F. Porikli, “Zero-shot adaptation of parameter-efficient fine-tuning in diffusion models,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Eds., vol....

work page 2025
[39]

Alphaedit: Null-space constrained knowledge editing for lan- guage models,

J. Fang, H. Jiang, K. Wang, Y . Ma, J. Shi, X. Wang, X. He, and T.-S. Chua, “Alphaedit: Null-space constrained knowledge editing for lan- guage models,” inThe Thirteenth International Conference on Learning Representations

work page
[40]

Lion-lora: Rethinking lora fusion to unify controllable spatial and temporal generation for video diffusion,

Y . Zhang, C. Cao, C. Yu, and J. Zhu, “Lion-lora: Rethinking lora fusion to unify controllable spatial and temporal generation for video diffusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 569–14 579

work page 2025
[41]

Lora-null: Low-rank adaptation via null space for large language models,

P. Tang, Y . Liu, D. Zhang, X. Wu, and D. Zhang, “Lora-null: Low-rank adaptation via null space for large language models,”arXiv preprint arXiv:2503.02659, 2025

work page arXiv 2025
[42]

Image style transfer using convolutional neural networks,

L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2414–2423

work page 2016
[43]

Arbitrary style transfer in real-time with adaptive instance normalization,

X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” inProceedings of the IEEE interna- tional conference on computer vision, 2017, pp. 1501–1510

work page 2017
[44]

Universal style transfer via feature transforms,

Y . Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[45]

Tikhonov and V

A. Tikhonov and V . Arsenin,Solutions of Ill-posed Problems, ser. Halsted Press book. Winston, 1977. [Online]. Available: https://books.google.co.jp/books?id=ECrvAAAAMAAJ

work page 1977
[46]

Woodbury and P

M. Woodbury and P. U. D. of Statistics,Inverting Modified Matrices, ser. Memorandum Report / Statistical Research Group, Princeton. Department of Statistics, Princeton University, 1950. [Online]. Available: https://books.google.co.jp/books?id= zAnzgEACAAJ

work page 1950
[47]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510

work page 2023
[48]

Styledrop: Text-to-image generation in any style,

K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y . Li,et al., “Styledrop: Text-to-image generation in any style,”arXiv preprint arXiv:2306.00983, 2023

work page arXiv 2023
[49]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Ed...

work page 2021
[51]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,”arXiv preprint arXiv:2203.03605, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

From clip to dino: Visual encoders shout in multi-modal large language models,

D. Jiang, Y . Liu, S. Liu, J. Zhao, H. Zhang, Z. Gao, X. Zhang, J. Li, and H. Xiong, “From clip to dino: Visual encoders shout in multi-modal large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2310.08825

work page arXiv 2024
[53]

Van Rijsbergen,Information Retrieval

C. Van Rijsbergen,Information Retrieval. Butterworths, 1979. [Online]. Available: https://books.google.co.jp/books?id=t-pTAAAAMAAJ

work page 1979
[54]

Muc-4 evaluation metrics,

N. Chinchor, “Muc-4 evaluation metrics,” inProceedings of the 4th Conference on Message Understanding, ser. MUC4 ’92. USA: Association for Computational Linguistics, 1992, p. 22–29. [Online]. Available: https://doi.org/10.3115/1072064.1072067

work page doi:10.3115/1072064.1072067 1992
[55]

Performance measures and a data set for multi-target, multi-camera tracking,

E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2016, pp. 17–35, proposes IDF1, a harmonic mean of ID precision and recall for tracking evaluation

work page 2016
[56]

Zero-shot learning – the good, the bad and the ugly,

Y . Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning – the good, the bad and the ugly,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4582– 4591, introduces the harmonic mean (H-score) of seen/unseen accuracies for balanced evaluation in generalized zero-shot learning. SUPPLEMENTARYMATE...

work page 2017

[1] [1]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022

[2] [2]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840– 6851, 2020

work page 2020

[3] [3]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=St1giarCHLP

work page 2021

[4] [4]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 8780–

work page 2021

[5] [5]

Available: https://proceedings.neurips.cc/paper files/ paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf

[Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf

work page 2021

[6] [6]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2022, pp. 10 684–10 695

work page 2022

[7] [7]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510. 8

work page 2023

[8] [8]

Ziplora: Any subject in any style by effectively merging loras,

V . Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y . Li, and V . Jampani, “Ziplora: Any subject in any style by effectively merging loras,” in European Conference on Computer Vision. Springer, 2024, pp. 422– 438

work page 2024

[9] [9]

K-lora: Unlocking training-free fusion of any subject and style loras,

Z. Ouyang, Z. Li, and Q. Hou, “K-lora: Unlocking training-free fusion of any subject and style loras,” inCVPR, 2025

work page 2025

[10] [10]

Lora.rar: Learning to merge loras via hypernetworks for subject-style conditioned image generation,

D. Shenaj, O. Bohdal, M. Ozay, P. Zanuttigh, and U. Michieli, “Lora.rar: Learning to merge loras via hypernetworks for subject-style conditioned image generation,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), October 2025

work page 2025

[11] [11]

A neural space- time representation for text-to-image personalization,

Y . Alaluf, E. Richardson, G. Metzer, and D. Cohen-Or, “A neural space- time representation for text-to-image personalization,”ACM Transac- tions on Graphics (TOG), vol. 42, no. 6, pp. 1–10, 2023

work page 2023

[12] [12]

p+: Ex- tended textual conditioning in text-to-image generation,

A. V oynov, Q. Chu, D. Cohen-Or, and K. Aberman, “p+: Ex- tended textual conditioning in text-to-image generation,”arXiv preprint arXiv:2303.09522, 2023

work page arXiv 2023

[13] [13]

Inversion-based style transfer with diffusion models,

Y . Zhang, N. Huang, F. Tang, H. Huang, C. Ma, W. Dong, and C. Xu, “Inversion-based style transfer with diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10 146–10 156

work page 2023

[14] [14]

Multi- concept customization of text-to-image diffusion,

N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi- concept customization of text-to-image diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1931–1941

work page 2023

[15] [15]

Break-a-scene: Extracting multiple concepts from a single image,

O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and D. Lischinski, “Break-a-scene: Extracting multiple concepts from a single image,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–12

work page 2023

[16] [16]

Instantbooth: Personalized text-to-image generation without test-time finetuning,

J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 8543–8552

work page 2024

[17] [17]

Fastcomposer: Tuning-free multi-subject image generation with localized attention,

G. Xiao, T. Yin, W. T. Freeman, F. Durand, and S. Han, “Fastcomposer: Tuning-free multi-subject image generation with localized attention,” International Journal of Computer Vision, vol. 133, no. 3, pp. 1175– 1194, 2025

work page 2025

[18] [18]

Smartbrush: Text and shape guided object inpainting with diffusion model,

S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 428–22 437

work page 2023

[19] [19]

Lora+: Efficient low rank adaptation of large models,

S. Hayou, N. Ghosh, and B. Yu, “Lora+: Efficient low rank adaptation of large models,”arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024

[20] [20]

Vera: Vector-based random matrix adaptation,

D. J. Kopiczko, T. Blankevoort, and Y . M. Asano, “Vera: Vector-based random matrix adaptation,”arXiv preprint arXiv:2310.11454, 2023

work page arXiv 2023

[21] [21]

Melora: Mini-ensemble low-rank adapters for parameter-efficient fine-tuning,

P. Ren, C. Shi, S. Wu, M. Zhang, Z. Ren, M. de Rijke, Z. Chen, and J. Pei, “Melora: Mini-ensemble low-rank adapters for parameter-efficient fine-tuning,”arXiv preprint arXiv:2402.17263, 2024

work page arXiv 2024

[22] [22]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li, “Lora-fa: Memory- efficient low-rank adaptation for large language models fine-tuning,” arXiv preprint arXiv:2308.03303, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Lora-drop: Efficient lora parameter pruning based on output evaluation,

H. Zhou, X. Lu, W. Xu, C. Zhu, T. Zhao, and M. Yang, “Lora-drop: Efficient lora parameter pruning based on output evaluation,”arXiv preprint arXiv:2402.07721, 2024

work page arXiv 2024

[24] [24]

Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices,

B. Zi, X. Qi, L. Wang, J. Wang, K.-F. Wong, and L. Zhang, “Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices,” arXiv preprint arXiv:2309.02411, 2023

work page arXiv 2023

[25] [25]

Duolora : Cycle-consistent and rank-disentangled content-style personalization,

A. Roy, S. Borse, S. Kadambi, D. Das, S. Mahajan, R. Garrepalli, H. Park, A. Nayak, R. Chellappa, M. Hayat, and F. Porikli, “Duolora : Cycle-consistent and rank-disentangled content-style personalization,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 15 395–15 404

work page 2025

[26] [26]

Implicit style- content separation using b-lora,

Y . Frenkel, Y . Vinker, A. Shamir, and D. Cohen-Or, “Implicit style- content separation using b-lora,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 181–198

work page 2024

[27] [27]

Csgo: Content-style composition in text-to-image generation,

P. Xing, H. Wang, Y . Sun, Q. Wang, X. Bai, H. Ai, R. Huang, and Z. Li, “Csgo: Content-style composition in text-to-image generation,” arXiv preprint arXiv:2408.16766, 2024

work page arXiv 2024

[28] [28]

How to continually adapt text-to-image diffusion models for flexible customization?

J. Dong, W. Liang, H. Li, D. Zhang, M. Cao, H. Ding, S. H. Khan, and F. Shahbaz Khan, “How to continually adapt text-to-image diffusion models for flexible customization?”Advances in Neural Information Processing Systems, vol. 37, pp. 130 057–130 083, 2024

work page 2024

[29] [29]

Mix-of-show: Decentralized low-rank adap- tation for multi-concept customization of diffusion models,

Y . Gu, X. Wang, J. Z. Wu, Y . Shi, Y . Chen, Z. Fan, W. Xiao, R. Zhao, S. Chang, W. Wu,et al., “Mix-of-show: Decentralized low-rank adap- tation for multi-concept customization of diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 15 890–15 902, 2023

work page 2023

[30] [30]

Mcˆ 2: Multi-concept guidance for customized multi-concept genera- tion,

J. Jiang, Y . Zhang, K. Feng, X. Wu, W. Li, R. Pei, F. Li, and W. Zuo, “Mcˆ 2: Multi-concept guidance for customized multi-concept genera- tion,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2802–2812

work page 2025

[31] [31]

Cones: Concept neurons in diffusion models for customized generation,

Z. Liu, R. Feng, K. Zhu, Y . Zhang, K. Zheng, Y . Liu, D. Zhao, J. Zhou, and Y . Cao, “Cones: Concept neurons in diffusion models for customized generation,”arXiv preprint arXiv:2303.05125, 2023

work page arXiv 2023

[32] [32]

Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models,

Y . Yang, W. Wang, L. Peng, C. Song, Y . Chen, H. Li, X. Yang, Q. Lu, D. Cai, B. Wu,et al., “Lora-composer: Leveraging low-rank adaptation for multi-concept customization in training-free diffusion models,”arXiv preprint arXiv:2403.11627, 2024

work page arXiv 2024

[33] [33]

Multi-lora composition for image generation,

M. Zhong, Y . Shen, S. Wang, Y . Lu, Y . Jiao, S. Ouyang, D. Yu, J. Han, and W. Chen, “Multi-lora composition for image generation,” Transactions on Machine Learning Research, vol. 2024, 2024

work page 2024

[34] [34]

Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout,

A. Zhang, X. Ding, H. Wang, S. McDonagh, and S. Kaski, “Rethinking inter-lora orthogonality in adapter merging: Insights from orthogonal monte carlo dropout,”arXiv preprint arXiv:2510.03262, 2025

work page arXiv 2025

[35] [35]

Subject or style: Adaptive and training- free mixture of loras,

J.-C. Zhang and Y .-J. Xiong, “Subject or style: Adaptive and training- free mixture of loras,”arXiv preprint arXiv:2508.02165, 2025

work page arXiv 2025

[36] [36]

Model merging with svd to tie the knots,

G. Stoica, P. Ramesh, B. Ecsedi, L. Choshen, and J. Hoffman, “Model merging with svd to tie the knots,” inThe Thirteenth International Conference on Learning Representations

work page

[37] [37]

Subzero: Composing subject, style, and action via zero-shot personalization,

S. Borse, K. Bhardwaj, M. R. K. Dastjerdi, H. Park, S. Kadambi, S. Shiv- akumar, P. Mandke, A. Nayak, H. Teague, M. Hayat,et al., “Subzero: Composing subject, style, and action via zero-shot personalization,” arXiv preprint arXiv:2502.19673, 2025

work page arXiv 2025

[38] [38]

Zero-shot adaptation of parameter-efficient fine-tuning in diffusion models,

F. Farhadzadeh, D. Das, S. Borse, and F. Porikli, “Zero-shot adaptation of parameter-efficient fine-tuning in diffusion models,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Eds., vol....

work page 2025

[39] [39]

Alphaedit: Null-space constrained knowledge editing for lan- guage models,

J. Fang, H. Jiang, K. Wang, Y . Ma, J. Shi, X. Wang, X. He, and T.-S. Chua, “Alphaedit: Null-space constrained knowledge editing for lan- guage models,” inThe Thirteenth International Conference on Learning Representations

work page

[40] [40]

Lion-lora: Rethinking lora fusion to unify controllable spatial and temporal generation for video diffusion,

Y . Zhang, C. Cao, C. Yu, and J. Zhu, “Lion-lora: Rethinking lora fusion to unify controllable spatial and temporal generation for video diffusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 569–14 579

work page 2025

[41] [41]

Lora-null: Low-rank adaptation via null space for large language models,

P. Tang, Y . Liu, D. Zhang, X. Wu, and D. Zhang, “Lora-null: Low-rank adaptation via null space for large language models,”arXiv preprint arXiv:2503.02659, 2025

work page arXiv 2025

[42] [42]

Image style transfer using convolutional neural networks,

L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2414–2423

work page 2016

[43] [43]

Arbitrary style transfer in real-time with adaptive instance normalization,

X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” inProceedings of the IEEE interna- tional conference on computer vision, 2017, pp. 1501–1510

work page 2017

[44] [44]

Universal style transfer via feature transforms,

Y . Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[45] [45]

Tikhonov and V

A. Tikhonov and V . Arsenin,Solutions of Ill-posed Problems, ser. Halsted Press book. Winston, 1977. [Online]. Available: https://books.google.co.jp/books?id=ECrvAAAAMAAJ

work page 1977

[46] [46]

Woodbury and P

M. Woodbury and P. U. D. of Statistics,Inverting Modified Matrices, ser. Memorandum Report / Statistical Research Group, Princeton. Department of Statistics, Princeton University, 1950. [Online]. Available: https://books.google.co.jp/books?id= zAnzgEACAAJ

work page 1950

[47] [47]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510

work page 2023

[48] [48]

Styledrop: Text-to-image generation in any style,

K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y . Li,et al., “Styledrop: Text-to-image generation in any style,”arXiv preprint arXiv:2306.00983, 2023

work page arXiv 2023

[49] [49]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Ed...

work page 2021

[51] [51]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,”arXiv preprint arXiv:2203.03605, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[52] [52]

From clip to dino: Visual encoders shout in multi-modal large language models,

D. Jiang, Y . Liu, S. Liu, J. Zhao, H. Zhang, Z. Gao, X. Zhang, J. Li, and H. Xiong, “From clip to dino: Visual encoders shout in multi-modal large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2310.08825

work page arXiv 2024

[53] [53]

Van Rijsbergen,Information Retrieval

C. Van Rijsbergen,Information Retrieval. Butterworths, 1979. [Online]. Available: https://books.google.co.jp/books?id=t-pTAAAAMAAJ

work page 1979

[54] [54]

Muc-4 evaluation metrics,

N. Chinchor, “Muc-4 evaluation metrics,” inProceedings of the 4th Conference on Message Understanding, ser. MUC4 ’92. USA: Association for Computational Linguistics, 1992, p. 22–29. [Online]. Available: https://doi.org/10.3115/1072064.1072067

work page doi:10.3115/1072064.1072067 1992

[55] [55]

Performance measures and a data set for multi-target, multi-camera tracking,

E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2016, pp. 17–35, proposes IDF1, a harmonic mean of ID precision and recall for tracking evaluation

work page 2016

[56] [56]

Zero-shot learning – the good, the bad and the ugly,

Y . Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning – the good, the bad and the ugly,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4582– 4591, introduces the harmonic mean (H-score) of seen/unseen accuracies for balanced evaluation in generalized zero-shot learning. SUPPLEMENTARYMATE...

work page 2017