pith. machine review for the scientific record. sign in

arxiv: 2512.10955 · v2 · submitted 2025-12-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary attribute encodervisual concept personalizationattribute disentanglementimage synthesiscontrastive learninggenerative modelsattribute retrieval
0
0 comments X

The pith

Omni-Attribute is the first open-vocabulary encoder that learns isolated representations for single visual attributes like identity or lighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replace general-purpose image embeddings, which mix many visual factors, with a dedicated encoder that captures only one chosen attribute at a time. It does so by building training data from image pairs that differ in exactly one labeled attribute and by training the model with two goals at once: accurate reconstruction of the changed image and contrastive separation from the unchanged one. Readers would care because this separation would let image-editing systems transfer a single trait such as expression or style into new scenes while leaving all other aspects untouched. The resulting embeddings are shown to work for retrieval, personalization, and combining several attributes in one output.

Core claim

We introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model by curating semantically linked image pairs annotated with positive and negative attributes and by adopting a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

What carries the argument

Semantically linked positive-negative image pairs trained with a dual objective that rewards both accurate image reconstruction and contrastive separation of the target attribute.

Load-bearing premise

Curating image pairs that differ in only one annotated attribute and training with dual fidelity and contrastive objectives will isolate that attribute without leakage into other factors.

What would settle it

If retrieval experiments show that an attribute embedding still correlates with unrelated factors such as lighting when only identity was labeled, or if personalization outputs alter non-target regions, the isolation claim would be falsified.

Figures

Figures reproduced from arXiv: 2512.10955 by Aliaksandr Siarohin, Anil Kag, Egor Nemchinov, Gordon Guocheng Qian, Ivan Skorokhodov, Jun-Yan Zhu, Kuan-Chieh Jackson Wang, Moayed Haji-Ali, Riza Alp Guler, Sergey Tulyakov, Tsai-Shien Chen, Willi Menapace.

Figure 1
Figure 1. Figure 1: Omni-Attribute is an open-vocabulary image attribute encoder that learns to extract attribute-specific representations from visual inputs. Given reference images (top row) paired with textual attribute descriptions (colored text boxes), Omni-Attribute encodes attribute representations that can be coherently synthesized in new contexts (middle and bottom rows) in a fully feed-forward manner, without any tes… view at source ↗
Figure 2
Figure 2. Figure 2: Training data annotation. Our training data consist of semantically linked image pairs annotated with positive and negative attributes that define their relationships through the shared and differing characteristics. The word cloud on the right highlights the richness and diversity of our attribute annotations, facilitating the training of an open-vocabulary attribute encoder. 15 years. Early approaches, s… view at source ↗
Figure 4
Figure 4. Figure 4: Model architecture. Our attribute encoder is a LoRA￾tuned MLLM followed by a trainable lightweight connector to pre￾serve strong vision-language prior while capable of adapting to our attribute disentanglement task. The image decoder is a frozen gen￾erator with trainable IP-Adapter [72] modules for personalization. in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons of open-vocabulary attribute personalization. Each row, from top to bottom, shows (i) the reference image-attribute pair and the prompt, (ii) results generated using CLIP [54], DINOv2 [46], and Qwen-VL [66] embeddings, (iii) results from editing models, including OmniGen2 [70], FLUX-Kontext [35], and Qwen-Image-Edit [68], and (iv) results by Omni-Attribute. As shown, Omni-Attribute … view at source ↗
Figure 6
Figure 6. Figure 6: Quantitative comparisons of open-vocabulary attribute personalization. We compare Omni-Attribute with baseline methods on the personalization of two types of attributes: (a) concrete objects and (b) abstract concepts. We perform the evaluation across two metrics, image naturalness (higher is better) and conditioning fidelity (higher is better), using both MLLM [45] and human evaluations. Omni-Attribute con… view at source ↗
Figure 7
Figure 7. Figure 7: Composability of attribute embeddings. From top to bottom, each row shows the input conditions, the effect of a sin￾gle image-attribute pair, and the compositional results of multiple attributes, showing the composability of our attribute embeddings. The prompt is “A vase is standing against a plain background.” 4.3. Analysis of Attribute Embeddings To gain a deeper understanding of the learned attribute r… view at source ↗
Figure 8
Figure 8. Figure 8: T-SNE visualizations of attribute embedding spaces. We visualize the embedding spaces of the same 60 animal images across three different attributes and show that this same set of images is distributed differently and meaningfully across varying attributes. Query Image Retrieved Images (Top 3) Clothing Facial Expression Hairstyle GPT-4o + CLIP Omni￾Attribute (Ours) [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results of attribute-oriented image retrieval on CelebA [40]. Our embeddings enable image retrieval based on a specified attribute. Omni-Attribute surpasses the performance of text-guided retrieval by GPT-4o [45] and CLIP [54]. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Instruction prompt for the first stage of attribute annotation. 1157 × 868 for 4 : 3 images, and 1002 × 1002 for square images. We perform inference on 80GB H100 GPUs, where each image pair takes approximately 2.54 seconds to annotate. A.3. Model Architecture As described in Sec. 3.3, our attribute encoder consists of a LoRA-finetuned multimodal large language model (MLLM) followed by a fully trainable co… view at source ↗
Figure 12
Figure 12. Figure 12: Instruction prompt for MLLM evaluation. Prompt and Reference Attribute Rating Sliders Reference Image Generated Image Image naturalness Text fidelity Attribute fidelity [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Interface of the user study. Given the input conditions (top and right) and the generated image (center), participants are asked to rate three aspects: image naturalness, text fidelity, and attribute fidelity on a 1 (poor) to 5 (excellent) scale using the sliders (left). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional results of attribute disentanglement. Each row shows three generated images (right), which are conditioned on the same reference image (left) and the same textual prompt, but with different attribute inputs (colored boxes). As seen, given the same reference image, Omni-Attribute effectively extracts attribute-specific representations, enabling the coherent synthesis of the user-specified attrib… view at source ↗
Figure 15
Figure 15. Figure 15: Practical and creative applications of Omni-Attribute. From top to bottom, each row demonstrates the practical utility of Omni-Attribute across four real-world applications: (i) advertisement image synthesis, (ii) hairstyle customization, (iii) storytelling visualization, and (iv) creative content generation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
read the original abstract

Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Omni-Attribute, the first open-vocabulary image attribute encoder for visual concept personalization. It jointly designs data curation of semantically linked image pairs annotated with positive/negative attributes and a dual-objective training paradigm (generative fidelity plus contrastive disentanglement) to produce high-fidelity, attribute-specific embeddings. These embeddings are claimed to enable effective open-vocabulary attribute retrieval, personalization, and compositional generation while achieving SOTA performance across multiple benchmarks.

Significance. If the central claims hold, the work would meaningfully advance visual concept personalization by addressing entanglement in holistic embeddings, enabling more precise attribute transfer without leakage. The joint data-model design and open-vocabulary capability are strengths; reproducible code or machine-checked elements are not mentioned but would further strengthen impact if present.

major comments (2)
  1. [§3] §3 (Method, dual-objective training): The central claim that contrastive disentanglement on positive/negative pairs isolates single attributes without residual entanglement from correlated factors (e.g., lighting and expression) is load-bearing but unsupported by explicit leakage metrics or independence regularizers. The curation assumption that pairs differ only along the annotated attribute requires quantitative validation in experiments, as statistical correlations in visual data could undermine the attribute-specific representations.
  2. [§4] §4 (Experiments): The abstract asserts SOTA performance across benchmarks, yet no specific metrics, baselines, ablations, or error analysis are referenced in the provided text. Tables reporting quantitative results (e.g., retrieval accuracy, personalization FID) with comparisons are needed to substantiate the effectiveness claim; without them the evaluation is incomplete.
minor comments (2)
  1. [Abstract] Abstract: The description of 'semantically linked image pairs' could be clarified with an example or formal definition to make the curation process more transparent.
  2. [§3] Notation: Ensure consistent use of terms like 'generative fidelity' and 'contrastive disentanglement' when first introduced, with explicit loss formulations if equations are present.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our claims where needed.

read point-by-point responses
  1. Referee: [§3] §3 (Method, dual-objective training): The central claim that contrastive disentanglement on positive/negative pairs isolates single attributes without residual entanglement from correlated factors (e.g., lighting and expression) is load-bearing but unsupported by explicit leakage metrics or independence regularizers. The curation assumption that pairs differ only along the annotated attribute requires quantitative validation in experiments, as statistical correlations in visual data could undermine the attribute-specific representations.

    Authors: We agree that explicit quantitative support for the disentanglement claim is important. Our dual-objective training combines generative fidelity with contrastive losses on the curated pairs to encourage isolation, but we acknowledge the need for direct metrics. In the revision we will add leakage analysis (e.g., pairwise attribute correlation in the learned embeddings) and a quantitative check on the curation assumption via statistical tests on the training pairs and human verification of attribute isolation. We will also report an ablation with an added independence regularizer to quantify its effect. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts SOTA performance across benchmarks, yet no specific metrics, baselines, ablations, or error analysis are referenced in the provided text. Tables reporting quantitative results (e.g., retrieval accuracy, personalization FID) with comparisons are needed to substantiate the effectiveness claim; without them the evaluation is incomplete.

    Authors: We apologize for the lack of explicit references in the reviewed version. The full manuscript contains Section 4 with Tables 1–3 reporting retrieval accuracy, personalization FID, and compositional generation metrics, together with comparisons to CLIP, DINO, and prior personalization baselines, plus ablations on the dual objectives. We will revise the abstract and method section to directly cite these tables and add a short error analysis subsection. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents a methodological contribution consisting of data curation of semantically linked image pairs and a dual-objective training paradigm (generative fidelity plus contrastive disentanglement). No equations, derivations, or parameter-fitting steps are described in the provided text that reduce by construction to the inputs or to self-citations. The central claims rest on empirical training outcomes and external benchmarks rather than any self-referential definition, fitted-input prediction, or load-bearing self-citation chain. The approach is self-contained against standard contrastive objectives and data curation practices.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method description relies on standard contrastive and generative training without introducing new postulated objects.

pith-pipeline@v0.9.0 · 5512 in / 1043 out tokens · 42384 ms · 2026-05-16T22:50:11.318435+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 12 internal anchors

  1. [1]

    Break-a-scene: Extracting multiple concepts from a single image

    Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen- Or, and Dani Lischinski. Break-a-scene: Extracting multiple concepts from a single image. InSIGGRAPH Asia, 2023. 3

  2. [2]

    Animal image dataset.https://www

    Sourav Banerjee. Animal image dataset.https://www. kaggle . com / datasets / iamsouravbanerjee / animal - image - dataset - 90 - different - animals, 2024. 7

  3. [3]

    Precisecam: Precise camera control for text-to- image generation

    Edurne Bernal-Berdun, Ana Serrano, Belen Masia, Matheus Gadelha, Yannick Hold-Geoffroy, Xin Sun, and Diego Gutierrez. Precisecam: Precise camera control for text-to- image generation. InCVPR, 2025. 13

  4. [4]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InCVPR, 2023. 5 9

  5. [5]

    Unsupervised learn- ing of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. NeurIPS, 2020. 3

  6. [6]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 3

  7. [7]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 3, 9

  8. [8]

    Incremental false negative detection for contrastive learning

    Tsai-Shien Chen, Wei-Chih Hung, Hung-Yu Tseng, Shao-Yi Chien, and Ming-Hsuan Yang. Incremental false negative detection for contrastive learning. InICLR, 2022. 3, 9

  9. [9]

    Motion-conditioned diffu- sion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023

    Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023. 3

  10. [10]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024

  11. [11]

    Multi-subject open-set personalization in video generation

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aber- man, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InCVPR, 2025. 2, 3

  12. [12]

    Visual categorization with bags of keypoints

    Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and C ´edric Bray. Visual categorization with bags of keypoints. InECCVW, 2004. 2

  13. [13]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 3

  14. [14]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 3

  15. [15]

    VIMI: Grounding video gen- eration through multi-modal instruction

    Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai- Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, and Sergey Tulyakov. VIMI: Grounding video gen- eration through multi-modal instruction. InEMNLP, 2024. 3

  16. [16]

    An image is worth one word: Personalizing text-to-image generation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InICLR, 2023. 2, 3

  17. [17]

    Tokenverse: Versatile multi-concept personalization in token modulation space.SIGGRAPH, 2025

    Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space.SIGGRAPH, 2025. 3

  18. [18]

    Nano banana.https://aistudio.google

    Google. Nano banana.https://aistudio.google. com/models/gemini-2-5-flash-image, 2025. 3

  19. [19]

    Preventing shortcuts in adapter training via providing the shortcuts

    Anujraaj Argo Goyal, Guocheng Gordon Qian, Huseyin Coskun, Aarush Gupta, Himmy Tam, Daniil Ostashev, Ju Hu, Dhritiman Sagar, Sergey Tulyakov, Kfir Aberman, and Kuan-Chieh Jackson Wang. Preventing shortcuts in adapter training via providing the shortcuts. InNeurIPS, 2025. 16

  20. [20]

    Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020. 3

  21. [21]

    Livepor- trait: Efficient portrait animation with stitching and retarget- ing control.arXiv preprint arXiv:2407.03168, 2024

    Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Livepor- trait: Efficient portrait animation with stitching and retarget- ing control.arXiv preprint arXiv:2407.03168, 2024. 13

  22. [22]

    Conceptexpress: Harnessing diffusion models for single-image unsupervised concept extraction

    Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, and Kwan-Yee K Wong. Conceptexpress: Harnessing diffusion models for single-image unsupervised concept extraction. In ECCV, 2024. 3

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

  24. [24]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InCVPR, 2020. 3

  25. [25]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 3

  26. [26]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 3

  27. [27]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021. 5

  28. [28]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3

  29. [29]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 5, 16

  30. [30]

    Instantx flux.1-dev ip-adapter page.https:// huggingface.co/InstantX/FLUX.1- dev- IP- Adapter, 2024

    InstantX. Instantx flux.1-dev ip-adapter page.https:// huggingface.co/InstantX/FLUX.1- dev- IP- Adapter, 2024. 16

  31. [31]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 2, 3

  32. [32]

    Imagenet classification with deep convolutional neural net- works.NeurIPS, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.NeurIPS, 2012. 3

  33. [33]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InCVPR, 2023. 3

  34. [34]

    Flux.1-dev.https : / / huggingface

    Black Forest Labs. Flux.1-dev.https : / / huggingface . co / black - forest - labs / FLUX.1-dev, 2024. 13, 15

  35. [35]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  36. [36]

    Distortion-adaptive salient object detection in 360 omnidi- rectional images.IEEE Journal of Selected Topics in Signal Processing, 2019

    Jia Li, Jinming Su, Changqun Xia, and Yonghong Tian. Distortion-adaptive salient object detection in 360 omnidi- rectional images.IEEE Journal of Selected Topics in Signal Processing, 2019. 13

  37. [37]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 5, 16

  38. [38]

    Compositional visual generation with composable diffusion models

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InECCV, 2022. 5

  39. [39]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 5, 16

  40. [40]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InICCV, 2015. 7, 8

  41. [41]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 16

  42. [42]

    Visualiz- ing data using t-sne.Journal of machine learning research,

    Laurens van der Maaten and Geoffrey Hinton. Visualiz- ing data using t-sne.Journal of machine learning research,

  43. [43]

    Snap video: Scaled spatiotemporal transformers for text-to-video synthesis

    Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. InCVPR, 2024. 3

  44. [44]

    SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InICLR, 2022. 3

  45. [45]

    Gpt-4o.https://openai.com/index/ hello-gpt-4o/, 2025

    OpenAI. Gpt-4o.https://openai.com/index/ hello-gpt-4o/, 2025. 6, 7, 8, 16, 19

  46. [46]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

  47. [47]

    Semantic segmentation of outdoor panoramic images.Signal, Image and Video Pro- cessing, 2021

    Semih Orhan and Yalin Bastanlar. Semantic segmentation of outdoor panoramic images.Signal, Image and Video Pro- cessing, 2021. 13

  48. [48]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3, 16

  49. [49]

    Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024. 7

  50. [50]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 13

  51. [51]

    Omni-id: Holistic identity represen- tation designed for generative tasks

    Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen- Or, and Kfir Aberman. Omni-id: Holistic identity represen- tation designed for generative tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8786–8795, 2025. 3

  52. [52]

    Composeme: Attribute-specific image prompts for controllable human image generation

    Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, and Kfir Aberman. Composeme: Attribute-specific image prompts for controllable human image generation. arXiv preprint arXiv:2509.18092, 2025

  53. [53]

    Layercomposer: Interactive personalized t2i via spatially-aware layered canvas.arXiv preprint arXiv:2510.20820, 2025

    Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Osta- shev, et al. Layercomposer: Interactive personalized t2i via spatially-aware layered canvas.arXiv preprint arXiv:2510.20820, 2025. 3

  54. [54]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2, 3, 5, 6, 7, 8, 17, 19

  55. [55]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 3

  56. [56]

    Disentan- gling visual embeddings for attributes and objects

    Nirat Saini, Khoi Pham, and Abhinav Shrivastava. Disentan- gling visual embeddings for attributes and objects. InCVPR,

  57. [57]

    Instant- booth: Personalized text-to-image generation without test- time finetuning

    Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instant- booth: Personalized text-to-image generation without test- time finetuning. InCVPR, 2024. 3

  58. [58]

    Lora vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228, 2024

    Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. Lora vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228, 2024. 5, 9

  59. [59]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 3

  60. [60]

    Saliency in vr: How do people explore virtual envi- ronments?IEEE transactions on visualization and computer graphics, 2018

    Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wet- zstein. Saliency in vr: How do people explore virtual envi- ronments?IEEE transactions on visualization and computer graphics, 2018. 13

  61. [61]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015. 3

  62. [62]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3

  63. [63]

    Plug-and-play diffusion features for text-driven image-to-image translation

    Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InCVPR, 2023. 3

  64. [64]

    Attention is all you need

    A Vaswani. Attention is all you need. InNeurIPS, 2017. 3

  65. [65]

    Concept decomposition for visual exploration and inspiration.ACM TOG, 2023

    Yael Vinker, Andrey V oynov, Daniel Cohen-Or, and Ariel Shamir. Concept decomposition for visual exploration and inspiration.ACM TOG, 2023. 3 11

  66. [66]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 4, 5, 6, 7, 13, 14, 16, 17

  67. [67]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 4, 14

  68. [68]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 5, 6, 7, 13, 17

  69. [69]

    Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025

    Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, et al. Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025. 7

  70. [70]

    Omnigen: Unified image genera- tion

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InCVPR, 2025. 3, 5, 6, 7, 17

  71. [71]

    Language-guided visual perception dis- entanglement for image quality assessment and conditional image generation.arXiv preprint arXiv:2503.02206, 2025

    Zhichao Yang, Leida Li, Pengfei Chen, Jinjian Wu, and Giuseppe Valenzise. Language-guided visual perception dis- entanglement for image quality assessment and conditional image generation.arXiv preprint arXiv:2503.02206, 2025. 3

  72. [72]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

  73. [73]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 13

  74. [74]

    A fixation-based 360 benchmark dataset for salient object detection

    Yi Zhang, Lu Zhang, Wassim Hamidouche, and Olivier De- forges. A fixation-based 360 benchmark dataset for salient object detection. InICIP, 2020. 13

  75. [75]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien- Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experi- ences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 16

  76. [76]

    person identity,

    Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, and Guanbin Li. Mod-adapter: Tuning-free and versatile multi-concept personalization via modulation adapter.arXiv preprint arXiv:2505.18612, 2025. 3 12 Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization Supplementary Material A. Implementatio...