pith. machine review for the scientific record. sign in

arxiv: 2605.12134 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.LG

Recognition: no theorem link

MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:24 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords Imaging Factor DisentanglementTextual InversionDisentanglementText-to-Image GenerationCamera LensSensor TypesViewpointsDomain Characteristics
0
0 comments X

The pith

MULTI disentangles camera lens, sensor, view, and domain factors using two-stage textual inversion to create novel image combinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Imaging Factor Disentanglement as a challenge for text-to-image models that currently control content but overlook properties such as camera lens, sensor type, viewpoint, and domain. It proposes MULTI, a method that first learns general factors across datasets and then extracts dataset-specific ones through textual inversion. This separation allows novel combinations of factors, extension of existing datasets, and reduction of distribution gaps while supporting independent modifications and image-to-image generation via ControlNets.

Core claim

By separating the learning of general imaging factors in the first stage of textual inversion from dataset-specific factors in the second stage, MULTI enables the disentanglement of camera lens, sensor types, viewpoints, and domain characteristics. This setup allows the generation of images with previously unseen factor combinations, the extension of existing datasets, and the reduction of distribution gaps between real and synthetic images.

What carries the argument

MULTI, a two-stage textual inversion process that isolates general factors first and then dataset-specific ones to achieve multi-factor disentanglement.

If this is right

  • This setup enables the extension of existing datasets through novel factor combinations.
  • Distribution gaps between real and generated images are reduced.
  • Specific factors can be modified independently while supporting image-to-image generation with ControlNets.
  • The effectiveness is shown through evaluation on the DF-RICO benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the isolation holds, imaging conditions could be composed modularly in synthesis pipelines much like separate controls for content and style.
  • The two-stage pattern might extend to other attributes such as lighting conditions or material properties in later models.
  • Practical use would need checks that disentanglement remains stable when applied to camera models and scenes far outside the original training sets.

Load-bearing premise

The two-stage textual inversion isolates general and dataset-specific imaging factors without leakage or mixing between them.

What would settle it

Generate images with novel combinations of factors absent from training and verify through metrics or human evaluation whether each intended factor such as lens type or domain can be identified independently without interference from the others.

Figures

Figures reproduced from arXiv: 2605.12134 by Danda Pani Paudel, Maarten Bieshaar, Matthias Neuwirth-Trapp, Michael Moeller, Sonali Godavarthy, Tim-Felix Faasch.

Figure 1
Figure 1. Figure 1: Overview of Factor Disentanglement. In this work, we propose the challenge of Imaging Factor Disentanglement, namely, the camera lens and sensor types with associated color grading, viewpoint, and domain from a set of sparse and unpaired datasets. the data domain. Precise control over these factors is critical for the applica￾tion of diffusion models as task-specific simulations. Although approaches like L… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MULTI framework. We optimize factor embeddings in a two-stage process: first, obtaining general embeddings, then refining them into dataset-specific ones. A specialized batching strategy is used to enforce factor overlap within each batch. of length L and of dimension d. The objective is to learn ψ such that samples x ∼ G(ψ(f)) faithfully reflect the imaging characteristics specified by f, … view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the number of learnable vectors and ControlNets on FID (left) and CLIP score (middle), and FAA (right). 1 5 10 15 Number of Vectors 60 65 70 75 80 FID Score 1 5 10 15 Number of Vectors 0.65 0.70 0.75 0.80 0.85 Average FAA k 0 1 2 3 4 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of Fraction of general and specific factors in the prompt on FID and FAA. Here, k denotes the number of general factor tokens in the prompt, while k = 4 corresponds to dataset-specific tokens. identifies n = 15 as the optimal value, achieving both low FID and high FAA. As n increases, the CLIP score decreases, likely since the factor embeddings overshadow the descriptive prompt. Larger values of n a… view at source ↗
read the original abstract

Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Imaging Factor Disentanglement as a new challenge for text-to-image models, noting that existing work overlooks factors such as camera lens, sensor type, viewpoint, and domain. It proposes MULTI, a two-stage textual inversion method in which the first stage learns general factors and the second extracts dataset-specific ones. This is claimed to enable dataset extension, novel factor combinations, distribution-gap reduction, and image-to-image generation via ControlNets. Effectiveness is demonstrated on the new DF-RICO benchmark.

Significance. If the two-stage process can reliably isolate imaging factors, the work would open a useful direction for fine-grained control in generative models that goes beyond object and style composition, with potential benefits for dataset augmentation and generalization.

major comments (1)
  1. [Method (two-stage textual inversion)] The central claim of disentanglement requires that stage-1 embeddings capture only generic imaging factors (lens, sensor, viewpoint) while stage-2 embeddings capture only dataset-specific residuals, with no cross-contamination. The method description performs sequential optimization of separate pseudo-tokens without orthogonality loss, mutual-information penalty, or cycle-consistency constraint between the two embedding sets. When training images contain correlated factors, the optimization can distribute information across both stages, violating the isolation needed for novel factor recombination and distribution-gap reduction.
minor comments (2)
  1. The manuscript supplies no implementation details, quantitative metrics, ablation studies, or error analysis, making it impossible to verify whether the described stages actually support the disentanglement claims.
  2. No equations or formal derivations appear to define the pseudo-token optimization or the separation between general and dataset-specific factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the two-stage textual inversion method below and will revise the paper to strengthen the presentation of disentanglement.

read point-by-point responses
  1. Referee: [Method (two-stage textual inversion)] The central claim of disentanglement requires that stage-1 embeddings capture only generic imaging factors (lens, sensor, viewpoint) while stage-2 embeddings capture only dataset-specific residuals, with no cross-contamination. The method description performs sequential optimization of separate pseudo-tokens without orthogonality loss, mutual-information penalty, or cycle-consistency constraint between the two embedding sets. When training images contain correlated factors, the optimization can distribute information across both stages, violating the isolation needed for novel factor recombination and distribution-gap reduction.

    Authors: We appreciate the referee highlighting this key requirement for reliable disentanglement. In MULTI, stage 1 optimizes a set of pseudo-tokens on a broad collection of images drawn from multiple datasets to capture generic imaging factors (lens, sensor, viewpoint), while stage 2 optimizes a separate set of pseudo-tokens on the target dataset with stage-1 tokens frozen, allowing them to encode only the residual dataset-specific variations. The sequential nature and data selection are intended to encourage separation without explicit cross terms. Our experiments on DF-RICO, including novel factor recombination and distribution-gap reduction, provide empirical support for this isolation. That said, we agree that additional regularization would further guard against leakage when factors are correlated. In the revision we will add an orthogonality loss between the two embedding sets, report mutual-information estimates between stages, and include ablation studies on factor swapping to quantify the degree of disentanglement. revision: yes

Circularity Check

0 steps flagged

No circularity: new method construction with no derived quantities or self-referential definitions

full rationale

The paper proposes MULTI as a two-stage textual inversion procedure for imaging factor disentanglement. No equations, derivations, or quantitative predictions appear in the abstract or description. The central claims concern the empirical behavior of this new architecture on the introduced DF-RICO benchmark and its ability to enable novel factor combinations. These are presented as properties of the proposed construction rather than results obtained by fitting parameters to a subset of data and then predicting closely related quantities, or by any self-citation chain. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced. The method is therefore self-contained as an independent proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Only the abstract is available, so the ledger is limited to high-level elements named in the text; no specific fitted numerical parameters or formal axioms are stated.

invented entities (2)
  • MULTI method no independent evidence
    purpose: Disentangle imaging factors via two-stage textual inversion
    Newly proposed technique described in the abstract
  • DF-RICO benchmark no independent evidence
    purpose: Evaluate effectiveness of factor disentanglement
    Newly introduced evaluation dataset mentioned in the abstract

pith-pipeline@v0.9.0 · 5511 in / 1279 out tokens · 100452 ms · 2026-05-13T05:24:21.375858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

  1. [1]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Agnolucci, L., Baldrati, A., Del Bimbo, A., Bertini, M.: isearle: Improving textual inversion for zero-shot composed image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  2. [2]

    In: SIGGRAPH Asia

    Avrahami,O.,Aberman,K.,Fried,O.,Cohen-Or,D.,Lischinski,D.:Break-a-scene: Extracting multiple concepts from a single image. In: SIGGRAPH Asia. pp. 1–12 (2023)

  3. [3]

    arXiv preprint arXiv:2506.12447 (2025) 14 S

    Baisa, N.L., Pallam, B., Jayavel, A.: Clip-handid: Vision-language model for hand- based person identification. arXiv preprint arXiv:2506.12447 (2025) 14 S. Godavarthy, M. Neuwirth-Trapp et al

  4. [4]

    moco , url=

    Bijelic, M., Gruber, T., Mannan, F., Kraus, F., Ritter, W., Dietmayer, K., Heide, F.: Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather. In: CVPR. pp. 11679–11689 (Jun 2020). https://doi. org/10.1109/CVPR42600.2020.01170

  5. [5]

    Bishop, C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol. 4. Springer (2006)

  6. [6]

    In: ECCV

    Butt, M.A., Wang, K., Vazquez-Corral, J., van de Weijer, J.: Colorpeel: Color prompt learning with diffusion models via color and shape disentanglement. In: ECCV. pp. 456–472. Springer (2024)

  7. [7]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (May 2020)

  8. [8]

    arXiv preprint arXiv:2405.12944 (May 2024)

    Chen, Z., Qian, Y., Yang, X., Wang, C., Yang, M.: AMFD: Distillation via Adap- tive Multimodal Fusion for Multispectral Pedestrian Detection. arXiv preprint arXiv:2405.12944 (May 2024). https://doi.org/10.48550/arXiv.2405.12944

  9. [9]

    dong et al

    Dong,Z.,Wei,P.,Lin,L.:Dreamartist:Controllableone-shottext-to-imagegenera- tion via positive-negative adapter: Z. dong et al. International Journal of Computer Vision133(10), 7037–7053 (2025)

  10. [10]

    In: ICML

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML. pp. 12606–12633. PMLR (2024)

  11. [11]

    https://www.flir.com/oem/adas/adas-dataset-form/ (Accessed: 01122024)

    FLIR, T.: FREE Teledyne FLIR Thermal Dataset for Algorithm Training. https://www.flir.com/oem/adas/adas-dataset-form/ (Accessed: 01122024)

  12. [12]

    In: ICLR (2022)

    Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-or, D.: An image is worth one word: Personalizing text-to-image genera- tion using textual inversion. In: ICLR (2022)

  13. [13]

    ACM Transactions On Graphics (TOG)44(4), 1–11 (2025)

    Garibi,D.,Yadin,S.,Paiss,R.,Tov,O.,Zada,S.,Ephrat,A.,Michaeli,T.,Mosseri, I., Dekel, T.: Tokenverse: Versatile multi-concept personalization in token modu- lation space. ACM Transactions On Graphics (TOG)44(4), 1–11 (2025)

  14. [14]

    Nature629(8014), 1034–1040 (May 2024)

    Gehrig, D., Scaramuzza, D.: Low-latency automotive vision with event cameras. Nature629(8014), 1034–1040 (May 2024). https://doi.org/10.1038/ s41586-024-07409-w

  15. [15]

    IEEE Robot

    Gehrig, M., Aarents, W., Gehrig, D., Scaramuzza, D.: DSEC: A Stereo Event Camera Dataset for Driving Scenarios. IEEE Robot. Autom. Lett.6(3), 4947–4954 (Jul 2021). https://doi.org/10.1109/LRA.2021.3068942

  16. [16]

    Implications of

    Gochoo, M., Otgonbold, M.E., Ganbold, E., Hsieh, J.W., Chang, M.C., Chen, P.Y., Dorj, B., Al Jassmi, H., Batnasan, G., Alnajjar, F., Abduljabbar, M., Lin, F.P.: FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detec- tion. In: CVPR Workshops. pp. 5305–5313 (Jun 2023). https://doi.org/10.1109/ CVPRW59228.2023.00559

  17. [17]

    Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning, vol. 1. MIT Press (2016)

  18. [18]

    HyperNetworks

    Ha, D., Dai, A., Le, Q.V.: Hypernetworks. arXiv preprint arXiv:1609.09106 (2016)

  19. [19]

    In: Proceedings of the 2021 conference on empirical methods in natural language processing

    Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021)

  20. [20]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  21. [21]

    Journal of educational psychology24(6), 417 (1933) MUL TI: Disentangling Imaging Factors 15

    Hotelling, H.: Analysis of a complex of statistical variables into principal compo- nents. Journal of educational psychology24(6), 417 (1933) MUL TI: Disentangling Imaging Factors 15

  22. [22]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  23. [23]

    https://doi.org/10.48550/arXiv.1610.01983

    Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K., Vasude- van, R.: Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks? arXiv preprint arXiv:1610.01983 (Feb 2017). https://doi.org/10.48550/arXiv.1610.01983

  24. [24]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    Kansy, M., Naruniec, J., Schroers, C., Gross, M., Weber, R.M.: Reenact anything: Semantic video motion transfer using motion-textual inversion. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–12 (2025)

  25. [25]

    In: ICML

    Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML. pp. 12888– 12900. PMLR (2022)

  26. [26]

    In: ICLR (2017)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017)

  27. [27]

    Journal of machine learn- ing research9(Nov), 2579–2605 (2008)

    Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learn- ing research9(Nov), 2579–2605 (2008)

  28. [28]

    In: ECCV

    Motamed, S., Paudel, D.P., Van Gool, L.: Lego: Learning to disentangle and invert personalized concepts beyond object appearance in text-to-image diffusion models. In: ECCV. pp. 116–133. Springer (2024)

  29. [29]

    In: ICCVW

    Neuwirth-Trapp, M., Bieshaar, M., Paudel, D.P., Van Gool, L.: Rico: Two realistic benchmarks and an in-depth analysis for incremental learning in object detection. In: ICCVW. pp. 5153–5164 (2025)

  30. [30]

    In: ICML

    Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: ICML. pp. 16784–16804. PMLR (2022)

  31. [31]

    on lines and planes of closest fit to systems of points in space

    Pearson, K.: Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2(11), 559–572 (1901)

  32. [32]

    In: ICLR (2023)

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: ICLR (2023)

  33. [33]

    In: CVPR

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)

  34. [34]

    In: CVPR

    Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR. pp. 22500–22510 (2023)

  35. [35]

    Advances in neural information processing systems35, 36479–36494 (2022)

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)

  36. [36]

    Advances in neural information processing systems29(2016)

    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Advances in neural information processing systems29(2016)

  37. [37]

    Sensors22(11), 3992 (May 2022)

    Schneider, P., Anisimov, Y., Islam, R., Mirbach, B., Rambach, J., Stricker, D., Grandidier, F.: TIMo—A Dataset for Indoor Building Monitoring with a Time- of-Flight Camera. Sensors22(11), 3992 (May 2022). https://doi.org/10.3390/ s22113992

  38. [38]

    CoRR (2024) 16 S

    Shentu, J., Watson, M., Al Moubayed, N.: Attencraft: Attention-guided disentan- glement of multiple concepts for text-to-image customization. CoRR (2024) 16 S. Godavarthy, M. Neuwirth-Trapp et al

  39. [39]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  40. [40]

    In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems

    Sohn, K., Ruiz, N., Lee, K., Chin, D.C., Blok, I., Chang, H., Barber, J., Jiang, L., Entis, G., Li, Y., et al.: Styledrop: text-to-image generation in any style. In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems. pp. 66860–66889 (2023)

  41. [41]

    MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =

    Sun, T., Segu, M., Postels, J., Wang, Y., Van Gool, L., Schiele, B., Tombari, F., Yu, F.: SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Do- main Adaptation. In: CVPR. pp. 21339–21350 (Jun 2022). https://doi.org/10. 1109/CVPR52688.2022.02068

  42. [42]

    ACM Transactions on Graphics (TOG)42(6), 1–13 (2023)

    Vinker, Y., Voynov, A., Cohen-Or, D., Shamir, A.: Concept decomposition for visual exploration and inspiration. ACM Transactions on Graphics (TOG)42(6), 1–13 (2023)

  43. [43]

    In: WACV

    Wang, K., Yang, F., Raducanu, B., van de Weijer, J.: Multi-class textual-inversion secretly yields a semantic-agnostic classifier. In: WACV. pp. 4400–4409. IEEE (2025)

  44. [44]

    In: ICCV

    Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. In: ICCV. pp. 15943–15953 (2023)

  45. [45]

    arXiv preprint arXiv:2502.13081 (2025)

    Wei, Y., Zheng, Y., Zhang, Y., Liu, M., Ji, Z., Zhang, L., Zuo, W.: Personalized image generation with deep generative models: A decade survey. arXiv preprint arXiv:2502.13081 (2025)

  46. [46]

    In: Medical Imaging with Deep Learning

    de Wilde, B., Saha, A., de Rooij, M., Huisman, H., Litjens, G.: Medical diffusion on a budget: Textual inversion for medical image generation. In: Medical Imaging with Deep Learning. pp. 1687–1706. PMLR (2024)

  47. [47]

    Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

    Wrenninge, M., Unger, J.: Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing. arXiv preprint arXiv:1810.08705 (Oct 2018). https://doi.org/10. 48550/arXiv.1810.08705

  48. [48]

    Xu, C., Xu, Y., Zhang, H., Xu, X., He, S.: Dreamanime: Learning style-identity textualdisentanglementforanimeandbeyond.IEEETransactionsonVisualization and Computer Graphics (2024)

  49. [49]

    In: WACV

    Xu, Z., Hao, S., Han, K.: Cusconcept: Customized visual concept decomposition with diffusion models. In: WACV. pp. 3678–3687. IEEE (2025)

  50. [50]

    arXiv preprint arXiv:2307.08252 (Jul 2023)

    Yang, L., Li, L., Xin, X., Sun, Y., Song, Q., Wang, W.: Large-Scale Person Detection and Localization using Overhead Fisheye Cameras. arXiv preprint arXiv:2307.08252 (Jul 2023). https://doi.org/10.48550/arXiv.2307.08252

  51. [51]

    In: CVPR

    Yang, Z., Wu, D., Wu, C., Lin, Z., Gu, J., Wang, W.: A pedestrian is worth one prompt: Towards language guidance person re-identification. In: CVPR. pp. 17343–17353 (2024)

  52. [52]

    arXiv preprint arXiv:1905.01489 (Jul 2021)

    Yogamani, S., Hughes, C., Horgan, J., Sistu, G., Varley, P., O’Dea, D., Uricar, M., Milz, S., Simon, M., Amende, K., Witt, C., Rashed, H., Chennupati, S., Nayak, S., Mansoor, S., Perroton, X., Perez, P.: WoodScape: A multi-task, multi-camera fish- eye dataset for autonomous driving. arXiv preprint arXiv:1905.01489 (Jul 2021). https://doi.org/10.48550/arXi...

  53. [53]

    arxiv preprint arXiv:1805.04687 (Apr 2020)

    Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.: BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. arxiv preprint arXiv:1805.04687 (Apr 2020)

  54. [54]

    In: ICCV

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV. pp. 3836–3847 (2023) MUL TI: Disentangling Imaging Factors 17

  55. [55]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  56. [56]

    In: CVPR

    Zhang, Y., Yang, M., Zhou, Q., Wang, Z.: Attention calibration for disentangled text-to-image personalization. In: CVPR. pp. 4764–4774 (2024)

  57. [57]

    Mod-adapter: Tuning-free and versatile multi-concept personalization via modulation adapter.arXiv preprint arXiv:2505.18612, 2025

    Zhong, W., Yang, H., Liu, Z., He, H., He, Z., Niu, X., Zhang, D., Li, G.: Mod- adapter: Tuning-free and versatile multi-concept personalization via modulation adapter. arXiv preprint arXiv:2505.18612 (2025)

  58. [58]

    A <source> <sensor> image from <view> captured with <lens> showing <description>

    Zhu, P., Wen, L., Du, D., Bian, X., Fan, H., Hu, Q., Ling, H.: Detection and Tracking Meet Drones Challenge. PAMI44(11), 7380–7399 (Nov 2022). https: //doi.org/10.1109/TPAMI.2021.3119563 18 S. Godavarthy, M. Neuwirth-Trapp et al. Supplementary Material S.1 DF-RICO Benchmark To study our method, we build upon RICO benchmark [29], which was initially develo...