arxiv: 2605.12134 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.LG

Recognition: no theorem link

MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation

Sonali Godavarthy , Matthias Neuwirth-Trapp , Tim-Felix Faasch , Maarten Bieshaar , Michael Moeller , Danda Pani Paudel

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:24 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords Imaging Factor DisentanglementTextual InversionDisentanglementText-to-Image GenerationCamera LensSensor TypesViewpointsDomain Characteristics

0 comments

The pith

MULTI disentangles camera lens, sensor, view, and domain factors using two-stage textual inversion to create novel image combinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Imaging Factor Disentanglement as a challenge for text-to-image models that currently control content but overlook properties such as camera lens, sensor type, viewpoint, and domain. It proposes MULTI, a method that first learns general factors across datasets and then extracts dataset-specific ones through textual inversion. This separation allows novel combinations of factors, extension of existing datasets, and reduction of distribution gaps while supporting independent modifications and image-to-image generation via ControlNets.

Core claim

By separating the learning of general imaging factors in the first stage of textual inversion from dataset-specific factors in the second stage, MULTI enables the disentanglement of camera lens, sensor types, viewpoints, and domain characteristics. This setup allows the generation of images with previously unseen factor combinations, the extension of existing datasets, and the reduction of distribution gaps between real and synthetic images.

What carries the argument

MULTI, a two-stage textual inversion process that isolates general factors first and then dataset-specific ones to achieve multi-factor disentanglement.

If this is right

This setup enables the extension of existing datasets through novel factor combinations.
Distribution gaps between real and generated images are reduced.
Specific factors can be modified independently while supporting image-to-image generation with ControlNets.
The effectiveness is shown through evaluation on the DF-RICO benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the isolation holds, imaging conditions could be composed modularly in synthesis pipelines much like separate controls for content and style.
The two-stage pattern might extend to other attributes such as lighting conditions or material properties in later models.
Practical use would need checks that disentanglement remains stable when applied to camera models and scenes far outside the original training sets.

Load-bearing premise

The two-stage textual inversion isolates general and dataset-specific imaging factors without leakage or mixing between them.

What would settle it

Generate images with novel combinations of factors absent from training and verify through metrics or human evaluation whether each intended factor such as lens type or domain can be identified independently without interference from the others.

Figures

Figures reproduced from arXiv: 2605.12134 by Danda Pani Paudel, Maarten Bieshaar, Matthias Neuwirth-Trapp, Michael Moeller, Sonali Godavarthy, Tim-Felix Faasch.

**Figure 1.** Figure 1: Overview of Factor Disentanglement. In this work, we propose the challenge of Imaging Factor Disentanglement, namely, the camera lens and sensor types with associated color grading, viewpoint, and domain from a set of sparse and unpaired datasets. the data domain. Precise control over these factors is critical for the application of diffusion models as task-specific simulations. Although approaches like L… view at source ↗

**Figure 2.** Figure 2: Overview of the MULTI framework. We optimize factor embeddings in a two-stage process: first, obtaining general embeddings, then refining them into dataset-specific ones. A specialized batching strategy is used to enforce factor overlap within each batch. of length L and of dimension d. The objective is to learn ψ such that samples x ∼ G(ψ(f)) faithfully reflect the imaging characteristics specified by f, … view at source ↗

**Figure 3.** Figure 3: Effect of the number of learnable vectors and ControlNets on FID (left) and CLIP score (middle), and FAA (right). 1 5 10 15 Number of Vectors 60 65 70 75 80 FID Score 1 5 10 15 Number of Vectors 0.65 0.70 0.75 0.80 0.85 Average FAA k 0 1 2 3 4 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of Fraction of general and specific factors in the prompt on FID and FAA. Here, k denotes the number of general factor tokens in the prompt, while k = 4 corresponds to dataset-specific tokens. identifies n = 15 as the optimal value, achieving both low FID and high FAA. As n increases, the CLIP score decreases, likely since the factor embeddings overshadow the descriptive prompt. Larger values of n a… view at source ↗

read the original abstract

Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MULTI carves out imaging factor disentanglement as a new sub-problem but the two-stage textual inversion has no built-in protection against leakage between general and dataset-specific embeddings.

read the letter

The paper's main contribution is naming the gap in text-to-image models where content composition works but camera lens, sensor, viewpoint, and domain factors stay entangled. Their MULTI method splits textual inversion into two stages: first learning general factors across data, then pulling out dataset-specific residuals. This is meant to let users extend datasets, recombine factors in new ways, and reduce distribution shifts while staying compatible with ControlNets for image-to-image tasks. That framing is useful for anyone trying to inject more imaging physics into diffusion models without retraining from scratch. The abstract also flags limitations in existing approaches, which is a fair observation given how most inversion work stays focused on objects and styles. The DF-RICO benchmark they introduce could become a useful testbed if it holds up. The soft spot is exactly the one the stress-test note flags. Sequential optimization of separate pseudo-tokens without any orthogonality loss, mutual-information penalty, or cycle constraint means correlated factors in the training images can easily split across both stages. Nothing in the description prevents leakage, so the claim that novel combinations will reliably close distribution gaps rests on an untested assumption. The abstract supplies no numbers, ablations, or error analysis, which leaves the effectiveness claim impossible to check. This is for people already working on controllable generation and domain adaptation who want to experiment with factor separation. A reader who needs quantitative validation or formal separation guarantees will find the current version thin. It still deserves a serious referee because the problem is real and the two-stage direction is a reasonable starting point, even if the authors will have to add concrete training details and results before it can be evaluated properly.

Referee Report

1 major / 2 minor

Summary. The paper introduces Imaging Factor Disentanglement as a new challenge for text-to-image models, noting that existing work overlooks factors such as camera lens, sensor type, viewpoint, and domain. It proposes MULTI, a two-stage textual inversion method in which the first stage learns general factors and the second extracts dataset-specific ones. This is claimed to enable dataset extension, novel factor combinations, distribution-gap reduction, and image-to-image generation via ControlNets. Effectiveness is demonstrated on the new DF-RICO benchmark.

Significance. If the two-stage process can reliably isolate imaging factors, the work would open a useful direction for fine-grained control in generative models that goes beyond object and style composition, with potential benefits for dataset augmentation and generalization.

major comments (1)

[Method (two-stage textual inversion)] The central claim of disentanglement requires that stage-1 embeddings capture only generic imaging factors (lens, sensor, viewpoint) while stage-2 embeddings capture only dataset-specific residuals, with no cross-contamination. The method description performs sequential optimization of separate pseudo-tokens without orthogonality loss, mutual-information penalty, or cycle-consistency constraint between the two embedding sets. When training images contain correlated factors, the optimization can distribute information across both stages, violating the isolation needed for novel factor recombination and distribution-gap reduction.

minor comments (2)

The manuscript supplies no implementation details, quantitative metrics, ablation studies, or error analysis, making it impossible to verify whether the described stages actually support the disentanglement claims.
No equations or formal derivations appear to define the pseudo-token optimization or the separation between general and dataset-specific factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the two-stage textual inversion method below and will revise the paper to strengthen the presentation of disentanglement.

read point-by-point responses

Referee: [Method (two-stage textual inversion)] The central claim of disentanglement requires that stage-1 embeddings capture only generic imaging factors (lens, sensor, viewpoint) while stage-2 embeddings capture only dataset-specific residuals, with no cross-contamination. The method description performs sequential optimization of separate pseudo-tokens without orthogonality loss, mutual-information penalty, or cycle-consistency constraint between the two embedding sets. When training images contain correlated factors, the optimization can distribute information across both stages, violating the isolation needed for novel factor recombination and distribution-gap reduction.

Authors: We appreciate the referee highlighting this key requirement for reliable disentanglement. In MULTI, stage 1 optimizes a set of pseudo-tokens on a broad collection of images drawn from multiple datasets to capture generic imaging factors (lens, sensor, viewpoint), while stage 2 optimizes a separate set of pseudo-tokens on the target dataset with stage-1 tokens frozen, allowing them to encode only the residual dataset-specific variations. The sequential nature and data selection are intended to encourage separation without explicit cross terms. Our experiments on DF-RICO, including novel factor recombination and distribution-gap reduction, provide empirical support for this isolation. That said, we agree that additional regularization would further guard against leakage when factors are correlated. In the revision we will add an orthogonality loss between the two embedding sets, report mutual-information estimates between stages, and include ablation studies on factor swapping to quantify the degree of disentanglement. revision: yes

Circularity Check

0 steps flagged

No circularity: new method construction with no derived quantities or self-referential definitions

full rationale

The paper proposes MULTI as a two-stage textual inversion procedure for imaging factor disentanglement. No equations, derivations, or quantitative predictions appear in the abstract or description. The central claims concern the empirical behavior of this new architecture on the introduced DF-RICO benchmark and its ability to enable novel factor combinations. These are presented as properties of the proposed construction rather than results obtained by fitting parameters to a subset of data and then predicting closely related quantities, or by any self-citation chain. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced. The method is therefore self-contained as an independent proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Only the abstract is available, so the ledger is limited to high-level elements named in the text; no specific fitted numerical parameters or formal axioms are stated.

invented entities (2)

MULTI method no independent evidence
purpose: Disentangle imaging factors via two-stage textual inversion
Newly proposed technique described in the abstract
DF-RICO benchmark no independent evidence
purpose: Evaluate effectiveness of factor disentanglement
Newly introduced evaluation dataset mentioned in the abstract

pith-pipeline@v0.9.0 · 5511 in / 1279 out tokens · 100452 ms · 2026-05-13T05:24:21.375858+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

[1]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Agnolucci, L., Baldrati, A., Del Bimbo, A., Bertini, M.: isearle: Improving textual inversion for zero-shot composed image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

work page 2025
[2]

In: SIGGRAPH Asia

Avrahami,O.,Aberman,K.,Fried,O.,Cohen-Or,D.,Lischinski,D.:Break-a-scene: Extracting multiple concepts from a single image. In: SIGGRAPH Asia. pp. 1–12 (2023)

work page 2023
[3]

arXiv preprint arXiv:2506.12447 (2025) 14 S

Baisa, N.L., Pallam, B., Jayavel, A.: Clip-handid: Vision-language model for hand- based person identification. arXiv preprint arXiv:2506.12447 (2025) 14 S. Godavarthy, M. Neuwirth-Trapp et al

work page arXiv 2025
[4]

moco , url=

Bijelic, M., Gruber, T., Mannan, F., Kraus, F., Ritter, W., Dietmayer, K., Heide, F.: Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather. In: CVPR. pp. 11679–11689 (Jun 2020). https://doi. org/10.1109/CVPR42600.2020.01170

work page doi:10.1109/cvpr42600.2020.01170 2020
[5]

Bishop, C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol. 4. Springer (2006)

work page 2006
[6]

In: ECCV

Butt, M.A., Wang, K., Vazquez-Corral, J., van de Weijer, J.: Colorpeel: Color prompt learning with diffusion models via color and shape disentanglement. In: ECCV. pp. 456–472. Springer (2024)

work page 2024
[7]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (May 2020)

work page arXiv 1903
[8]

arXiv preprint arXiv:2405.12944 (May 2024)

Chen, Z., Qian, Y., Yang, X., Wang, C., Yang, M.: AMFD: Distillation via Adap- tive Multimodal Fusion for Multispectral Pedestrian Detection. arXiv preprint arXiv:2405.12944 (May 2024). https://doi.org/10.48550/arXiv.2405.12944

work page doi:10.48550/arxiv.2405.12944 2024
[9]

dong et al

Dong,Z.,Wei,P.,Lin,L.:Dreamartist:Controllableone-shottext-to-imagegenera- tion via positive-negative adapter: Z. dong et al. International Journal of Computer Vision133(10), 7037–7053 (2025)

work page 2025
[10]

In: ICML

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML. pp. 12606–12633. PMLR (2024)

work page 2024
[11]

https://www.flir.com/oem/adas/adas-dataset-form/ (Accessed: 01122024)

FLIR, T.: FREE Teledyne FLIR Thermal Dataset for Algorithm Training. https://www.flir.com/oem/adas/adas-dataset-form/ (Accessed: 01122024)

work page
[12]

In: ICLR (2022)

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-or, D.: An image is worth one word: Personalizing text-to-image genera- tion using textual inversion. In: ICLR (2022)

work page 2022
[13]

ACM Transactions On Graphics (TOG)44(4), 1–11 (2025)

Garibi,D.,Yadin,S.,Paiss,R.,Tov,O.,Zada,S.,Ephrat,A.,Michaeli,T.,Mosseri, I., Dekel, T.: Tokenverse: Versatile multi-concept personalization in token modu- lation space. ACM Transactions On Graphics (TOG)44(4), 1–11 (2025)

work page 2025
[14]

Nature629(8014), 1034–1040 (May 2024)

Gehrig, D., Scaramuzza, D.: Low-latency automotive vision with event cameras. Nature629(8014), 1034–1040 (May 2024). https://doi.org/10.1038/ s41586-024-07409-w

work page 2024
[15]

IEEE Robot

Gehrig, M., Aarents, W., Gehrig, D., Scaramuzza, D.: DSEC: A Stereo Event Camera Dataset for Driving Scenarios. IEEE Robot. Autom. Lett.6(3), 4947–4954 (Jul 2021). https://doi.org/10.1109/LRA.2021.3068942

work page doi:10.1109/lra.2021.3068942 2021
[16]

Implications of

Gochoo, M., Otgonbold, M.E., Ganbold, E., Hsieh, J.W., Chang, M.C., Chen, P.Y., Dorj, B., Al Jassmi, H., Batnasan, G., Alnajjar, F., Abduljabbar, M., Lin, F.P.: FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detec- tion. In: CVPR Workshops. pp. 5305–5313 (Jun 2023). https://doi.org/10.1109/ CVPRW59228.2023.00559

work page arXiv 2023
[17]

Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning, vol. 1. MIT Press (2016)

work page 2016
[18]

HyperNetworks

Ha, D., Dai, A., Le, Q.V.: Hypernetworks. arXiv preprint arXiv:1609.09106 (2016)

work page internal anchor Pith review arXiv 2016
[19]

In: Proceedings of the 2021 conference on empirical methods in natural language processing

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021)

work page 2021
[20]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

work page 2017
[21]

Journal of educational psychology24(6), 417 (1933) MUL TI: Disentangling Imaging Factors 15

Hotelling, H.: Analysis of a complex of statistical variables into principal compo- nents. Journal of educational psychology24(6), 417 (1933) MUL TI: Disentangling Imaging Factors 15

work page 1933
[22]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022
[23]

https://doi.org/10.48550/arXiv.1610.01983

Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K., Vasude- van, R.: Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks? arXiv preprint arXiv:1610.01983 (Feb 2017). https://doi.org/10.48550/arXiv.1610.01983

work page doi:10.48550/arxiv.1610.01983 2017
[24]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Kansy, M., Naruniec, J., Schroers, C., Gross, M., Weber, R.M.: Reenact anything: Semantic video motion transfer using motion-textual inversion. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–12 (2025)

work page 2025
[25]

In: ICML

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML. pp. 12888– 12900. PMLR (2022)

work page 2022
[26]

In: ICLR (2017)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017)

work page 2017
[27]

Journal of machine learn- ing research9(Nov), 2579–2605 (2008)

Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learn- ing research9(Nov), 2579–2605 (2008)

work page 2008
[28]

In: ECCV

Motamed, S., Paudel, D.P., Van Gool, L.: Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models. In: ECCV. pp. 116–133. Springer (2024)

work page 2024
[29]

In: ICCVW

Neuwirth-Trapp, M., Bieshaar, M., Paudel, D.P., Van Gool, L.: Rico: Two realistic benchmarks and an in-depth analysis for incremental learning in object detection. In: ICCVW. pp. 5153–5164 (2025)

work page 2025
[30]

In: ICML

Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: ICML. pp. 16784–16804. PMLR (2022)

work page 2022
[31]

on lines and planes of closest fit to systems of points in space

Pearson, K.: Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2(11), 559–572 (1901)

work page 1901
[32]

In: ICLR (2023)

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: ICLR (2023)

work page 2023
[33]

In: CVPR

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)

work page 2022
[34]

In: CVPR

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR. pp. 22500–22510 (2023)

work page 2023
[35]

Advances in neural information processing systems35, 36479–36494 (2022)

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

work page 2022
[36]

Advances in neural information processing systems29(2016)

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Advances in neural information processing systems29(2016)

work page 2016
[37]

Sensors22(11), 3992 (May 2022)

Schneider, P., Anisimov, Y., Islam, R., Mirbach, B., Rambach, J., Stricker, D., Grandidier, F.: TIMo—A Dataset for Indoor Building Monitoring with a Time- of-Flight Camera. Sensors22(11), 3992 (May 2022). https://doi.org/10.3390/ s22113992

work page 2022
[38]

CoRR (2024) 16 S

Shentu, J., Watson, M., Al Moubayed, N.: Attencraft: Attention-guided disentan- glement of multiple concepts for text-to-image customization. CoRR (2024) 16 S. Godavarthy, M. Neuwirth-Trapp et al

work page 2024
[39]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems

Sohn, K., Ruiz, N., Lee, K., Chin, D.C., Blok, I., Chang, H., Barber, J., Jiang, L., Entis, G., Li, Y., et al.: Styledrop: text-to-image generation in any style. In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems. pp. 66860–66889 (2023)

work page 2023
[41]

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =

Sun, T., Segu, M., Postels, J., Wang, Y., Van Gool, L., Schiele, B., Tombari, F., Yu, F.: SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Do- main Adaptation. In: CVPR. pp. 21339–21350 (Jun 2022). https://doi.org/10. 1109/CVPR52688.2022.02068

work page arXiv 2022
[42]

ACM Transactions on Graphics (TOG)42(6), 1–13 (2023)

Vinker, Y., Voynov, A., Cohen-Or, D., Shamir, A.: Concept decomposition for visual exploration and inspiration. ACM Transactions on Graphics (TOG)42(6), 1–13 (2023)

work page 2023
[43]

In: WACV

Wang, K., Yang, F., Raducanu, B., van de Weijer, J.: Multi-class textual-inversion secretly yields a semantic-agnostic classifier. In: WACV. pp. 4400–4409. IEEE (2025)

work page 2025
[44]

In: ICCV

Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. In: ICCV. pp. 15943–15953 (2023)

work page 2023
[45]

arXiv preprint arXiv:2502.13081 (2025)

Wei, Y., Zheng, Y., Zhang, Y., Liu, M., Ji, Z., Zhang, L., Zuo, W.: Personalized image generation with deep generative models: A decade survey. arXiv preprint arXiv:2502.13081 (2025)

work page arXiv 2025
[46]

In: Medical Imaging with Deep Learning

de Wilde, B., Saha, A., de Rooij, M., Huisman, H., Litjens, G.: Medical diffusion on a budget: Textual inversion for medical image generation. In: Medical Imaging with Deep Learning. pp. 1687–1706. PMLR (2024)

work page 2024
[47]

Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

Wrenninge, M., Unger, J.: Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing. arXiv preprint arXiv:1810.08705 (Oct 2018). https://doi.org/10. 48550/arXiv.1810.08705

work page Pith review arXiv 2018
[48]

Xu, C., Xu, Y., Zhang, H., Xu, X., He, S.: Dreamanime: Learning style-identity textualdisentanglementforanimeandbeyond.IEEETransactionsonVisualization and Computer Graphics (2024)

work page 2024
[49]

In: WACV

Xu, Z., Hao, S., Han, K.: Cusconcept: Customized visual concept decomposition with diffusion models. In: WACV. pp. 3678–3687. IEEE (2025)

work page 2025
[50]

arXiv preprint arXiv:2307.08252 (Jul 2023)

Yang, L., Li, L., Xin, X., Sun, Y., Song, Q., Wang, W.: Large-Scale Person Detection and Localization using Overhead Fisheye Cameras. arXiv preprint arXiv:2307.08252 (Jul 2023). https://doi.org/10.48550/arXiv.2307.08252

work page doi:10.48550/arxiv.2307.08252 2023
[51]

In: CVPR

Yang, Z., Wu, D., Wu, C., Lin, Z., Gu, J., Wang, W.: A pedestrian is worth one prompt: Towards language guidance person re-identification. In: CVPR. pp. 17343–17353 (2024)

work page 2024
[52]

arXiv preprint arXiv:1905.01489 (Jul 2021)

Yogamani, S., Hughes, C., Horgan, J., Sistu, G., Varley, P., O’Dea, D., Uricar, M., Milz, S., Simon, M., Amende, K., Witt, C., Rashed, H., Chennupati, S., Nayak, S., Mansoor, S., Perroton, X., Perez, P.: WoodScape: A multi-task, multi-camera fish- eye dataset for autonomous driving. arXiv preprint arXiv:1905.01489 (Jul 2021). https://doi.org/10.48550/arXi...

work page doi:10.48550/arxiv.1905.01489 1905
[53]

arxiv preprint arXiv:1805.04687 (Apr 2020)

Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.: BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. arxiv preprint arXiv:1805.04687 (Apr 2020)

work page arXiv 2020
[54]

In: ICCV

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV. pp. 3836–3847 (2023) MUL TI: Disentangling Imaging Factors 17

work page 2023
[55]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

work page 2018
[56]

In: CVPR

Zhang, Y., Yang, M., Zhou, Q., Wang, Z.: Attention calibration for disentangled text-to-image personalization. In: CVPR. pp. 4764–4774 (2024)

work page 2024
[57]

Mod-adapter: Tuning-free and versatile multi-concept personalization via modulation adapter.arXiv preprint arXiv:2505.18612, 2025

Zhong, W., Yang, H., Liu, Z., He, H., He, Z., Niu, X., Zhang, D., Li, G.: Mod- adapter: Tuning-free and versatile multi-concept personalization via modulation adapter. arXiv preprint arXiv:2505.18612 (2025)

work page arXiv 2025
[58]

A <source> <sensor> image from <view> captured with <lens> showing <description>

Zhu, P., Wen, L., Du, D., Bian, X., Fan, H., Hu, Q., Ling, H.: Detection and Tracking Meet Drones Challenge. PAMI44(11), 7380–7399 (Nov 2022). https: //doi.org/10.1109/TPAMI.2021.3119563 18 S. Godavarthy, M. Neuwirth-Trapp et al. Supplementary Material S.1 DF-RICO Benchmark To study our method, we build upon RICO benchmark [29], which was initially develo...

work page doi:10.1109/tpami.2021.3119563 2022