arxiv: 2512.10955 · v2 · submitted 2025-12-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Tsai-Shien Chen , Aliaksandr Siarohin , Gordon Guocheng Qian , Kuan-Chieh Jackson Wang , Egor Nemchinov , Moayed Haji-Ali , Riza Alp Guler , Willi Menapace

show 4 more authors

Ivan Skorokhodov Anil Kag Jun-Yan Zhu Sergey Tulyakov

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary attribute encodervisual concept personalizationattribute disentanglementimage synthesiscontrastive learninggenerative modelsattribute retrieval

0 comments

The pith

Omni-Attribute is the first open-vocabulary encoder that learns isolated representations for single visual attributes like identity or lighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replace general-purpose image embeddings, which mix many visual factors, with a dedicated encoder that captures only one chosen attribute at a time. It does so by building training data from image pairs that differ in exactly one labeled attribute and by training the model with two goals at once: accurate reconstruction of the changed image and contrastive separation from the unchanged one. Readers would care because this separation would let image-editing systems transfer a single trait such as expression or style into new scenes while leaving all other aspects untouched. The resulting embeddings are shown to work for retrieval, personalization, and combining several attributes in one output.

Core claim

We introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model by curating semantically linked image pairs annotated with positive and negative attributes and by adopting a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

What carries the argument

Semantically linked positive-negative image pairs trained with a dual objective that rewards both accurate image reconstruction and contrastive separation of the target attribute.

Load-bearing premise

Curating image pairs that differ in only one annotated attribute and training with dual fidelity and contrastive objectives will isolate that attribute without leakage into other factors.

What would settle it

If retrieval experiments show that an attribute embedding still correlates with unrelated factors such as lighting when only identity was labeled, or if personalization outputs alter non-target regions, the isolation claim would be falsified.

Figures

Figures reproduced from arXiv: 2512.10955 by Aliaksandr Siarohin, Anil Kag, Egor Nemchinov, Gordon Guocheng Qian, Ivan Skorokhodov, Jun-Yan Zhu, Kuan-Chieh Jackson Wang, Moayed Haji-Ali, Riza Alp Guler, Sergey Tulyakov, Tsai-Shien Chen, Willi Menapace.

**Figure 1.** Figure 1: Omni-Attribute is an open-vocabulary image attribute encoder that learns to extract attribute-specific representations from visual inputs. Given reference images (top row) paired with textual attribute descriptions (colored text boxes), Omni-Attribute encodes attribute representations that can be coherently synthesized in new contexts (middle and bottom rows) in a fully feed-forward manner, without any tes… view at source ↗

**Figure 2.** Figure 2: Training data annotation. Our training data consist of semantically linked image pairs annotated with positive and negative attributes that define their relationships through the shared and differing characteristics. The word cloud on the right highlights the richness and diversity of our attribute annotations, facilitating the training of an open-vocabulary attribute encoder. 15 years. Early approaches, s… view at source ↗

**Figure 4.** Figure 4: Model architecture. Our attribute encoder is a LoRAtuned MLLM followed by a trainable lightweight connector to preserve strong vision-language prior while capable of adapting to our attribute disentanglement task. The image decoder is a frozen generator with trainable IP-Adapter [72] modules for personalization. in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons of open-vocabulary attribute personalization. Each row, from top to bottom, shows (i) the reference image-attribute pair and the prompt, (ii) results generated using CLIP [54], DINOv2 [46], and Qwen-VL [66] embeddings, (iii) results from editing models, including OmniGen2 [70], FLUX-Kontext [35], and Qwen-Image-Edit [68], and (iv) results by Omni-Attribute. As shown, Omni-Attribute … view at source ↗

**Figure 6.** Figure 6: Quantitative comparisons of open-vocabulary attribute personalization. We compare Omni-Attribute with baseline methods on the personalization of two types of attributes: (a) concrete objects and (b) abstract concepts. We perform the evaluation across two metrics, image naturalness (higher is better) and conditioning fidelity (higher is better), using both MLLM [45] and human evaluations. Omni-Attribute con… view at source ↗

**Figure 7.** Figure 7: Composability of attribute embeddings. From top to bottom, each row shows the input conditions, the effect of a single image-attribute pair, and the compositional results of multiple attributes, showing the composability of our attribute embeddings. The prompt is “A vase is standing against a plain background.” 4.3. Analysis of Attribute Embeddings To gain a deeper understanding of the learned attribute r… view at source ↗

**Figure 8.** Figure 8: T-SNE visualizations of attribute embedding spaces. We visualize the embedding spaces of the same 60 animal images across three different attributes and show that this same set of images is distributed differently and meaningfully across varying attributes. Query Image Retrieved Images (Top 3) Clothing Facial Expression Hairstyle GPT-4o + CLIP OmniAttribute (Ours) [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results of attribute-oriented image retrieval on CelebA [40]. Our embeddings enable image retrieval based on a specified attribute. Omni-Attribute surpasses the performance of text-guided retrieval by GPT-4o [45] and CLIP [54]. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Instruction prompt for the first stage of attribute annotation. 1157 × 868 for 4 : 3 images, and 1002 × 1002 for square images. We perform inference on 80GB H100 GPUs, where each image pair takes approximately 2.54 seconds to annotate. A.3. Model Architecture As described in Sec. 3.3, our attribute encoder consists of a LoRA-finetuned multimodal large language model (MLLM) followed by a fully trainable co… view at source ↗

**Figure 12.** Figure 12: Instruction prompt for MLLM evaluation. Prompt and Reference Attribute Rating Sliders Reference Image Generated Image Image naturalness Text fidelity Attribute fidelity [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Interface of the user study. Given the input conditions (top and right) and the generated image (center), participants are asked to rate three aspects: image naturalness, text fidelity, and attribute fidelity on a 1 (poor) to 5 (excellent) scale using the sliders (left). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Additional results of attribute disentanglement. Each row shows three generated images (right), which are conditioned on the same reference image (left) and the same textual prompt, but with different attribute inputs (colored boxes). As seen, given the same reference image, Omni-Attribute effectively extracts attribute-specific representations, enabling the coherent synthesis of the user-specified attrib… view at source ↗

**Figure 15.** Figure 15: Practical and creative applications of Omni-Attribute. From top to bottom, each row demonstrates the practical utility of Omni-Attribute across four real-world applications: (i) advertisement image synthesis, (ii) hairstyle customization, (iii) storytelling visualization, and (iv) creative content generation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

read the original abstract

Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea is a new open-vocabulary encoder that uses positive-negative image pairs plus dual training to isolate single attributes for personalization, though real disentanglement still needs checking.

read the letter

The main thing to know is that this work introduces Omni-Attribute as the first open-vocabulary image attribute encoder built around curated semantically linked pairs and a dual-objective setup that mixes generative fidelity with contrastive disentanglement. It targets the real problem that general image encoders mix identity, expression, lighting, and style into one vector, which causes leakage when you try to transfer just one factor into new scenes.

Referee Report

2 major / 2 minor

Summary. The paper introduces Omni-Attribute, the first open-vocabulary image attribute encoder for visual concept personalization. It jointly designs data curation of semantically linked image pairs annotated with positive/negative attributes and a dual-objective training paradigm (generative fidelity plus contrastive disentanglement) to produce high-fidelity, attribute-specific embeddings. These embeddings are claimed to enable effective open-vocabulary attribute retrieval, personalization, and compositional generation while achieving SOTA performance across multiple benchmarks.

Significance. If the central claims hold, the work would meaningfully advance visual concept personalization by addressing entanglement in holistic embeddings, enabling more precise attribute transfer without leakage. The joint data-model design and open-vocabulary capability are strengths; reproducible code or machine-checked elements are not mentioned but would further strengthen impact if present.

major comments (2)

[§3] §3 (Method, dual-objective training): The central claim that contrastive disentanglement on positive/negative pairs isolates single attributes without residual entanglement from correlated factors (e.g., lighting and expression) is load-bearing but unsupported by explicit leakage metrics or independence regularizers. The curation assumption that pairs differ only along the annotated attribute requires quantitative validation in experiments, as statistical correlations in visual data could undermine the attribute-specific representations.
[§4] §4 (Experiments): The abstract asserts SOTA performance across benchmarks, yet no specific metrics, baselines, ablations, or error analysis are referenced in the provided text. Tables reporting quantitative results (e.g., retrieval accuracy, personalization FID) with comparisons are needed to substantiate the effectiveness claim; without them the evaluation is incomplete.

minor comments (2)

[Abstract] Abstract: The description of 'semantically linked image pairs' could be clarified with an example or formal definition to make the curation process more transparent.
[§3] Notation: Ensure consistent use of terms like 'generative fidelity' and 'contrastive disentanglement' when first introduced, with explicit loss formulations if equations are present.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our claims where needed.

read point-by-point responses

Referee: [§3] §3 (Method, dual-objective training): The central claim that contrastive disentanglement on positive/negative pairs isolates single attributes without residual entanglement from correlated factors (e.g., lighting and expression) is load-bearing but unsupported by explicit leakage metrics or independence regularizers. The curation assumption that pairs differ only along the annotated attribute requires quantitative validation in experiments, as statistical correlations in visual data could undermine the attribute-specific representations.

Authors: We agree that explicit quantitative support for the disentanglement claim is important. Our dual-objective training combines generative fidelity with contrastive losses on the curated pairs to encourage isolation, but we acknowledge the need for direct metrics. In the revision we will add leakage analysis (e.g., pairwise attribute correlation in the learned embeddings) and a quantitative check on the curation assumption via statistical tests on the training pairs and human verification of attribute isolation. We will also report an ablation with an added independence regularizer to quantify its effect. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts SOTA performance across benchmarks, yet no specific metrics, baselines, ablations, or error analysis are referenced in the provided text. Tables reporting quantitative results (e.g., retrieval accuracy, personalization FID) with comparisons are needed to substantiate the effectiveness claim; without them the evaluation is incomplete.

Authors: We apologize for the lack of explicit references in the reviewed version. The full manuscript contains Section 4 with Tables 1–3 reporting retrieval accuracy, personalization FID, and compositional generation metrics, together with comparisons to CLIP, DINO, and prior personalization baselines, plus ablations on the dual objectives. We will revise the abstract and method section to directly cite these tables and add a short error analysis subsection. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents a methodological contribution consisting of data curation of semantically linked image pairs and a dual-objective training paradigm (generative fidelity plus contrastive disentanglement). No equations, derivations, or parameter-fitting steps are described in the provided text that reduce by construction to the inputs or to self-citations. The central claims rest on empirical training outcomes and external benchmarks rather than any self-referential definition, fitted-input prediction, or load-bearing self-citation chain. The approach is self-contained against standard contrastive objectives and data curation practices.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method description relies on standard contrastive and generative training without introducing new postulated objects.

pith-pipeline@v0.9.0 · 5512 in / 1043 out tokens · 42384 ms · 2026-05-16T22:50:11.318435+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-objective training paradigm that balances generative fidelity with contrastive disentanglement
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

semantically linked image pairs annotated with positive and negative attributes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 12 internal anchors

[1]

Break-a-scene: Extracting multiple concepts from a single image

Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen- Or, and Dani Lischinski. Break-a-scene: Extracting multiple concepts from a single image. InSIGGRAPH Asia, 2023. 3

work page 2023
[2]

Animal image dataset.https://www

Sourav Banerjee. Animal image dataset.https://www. kaggle . com / datasets / iamsouravbanerjee / animal - image - dataset - 90 - different - animals, 2024. 7

work page 2024
[3]

Precisecam: Precise camera control for text-to- image generation

Edurne Bernal-Berdun, Ana Serrano, Belen Masia, Matheus Gadelha, Yannick Hold-Geoffroy, Xin Sun, and Diego Gutierrez. Precisecam: Precise camera control for text-to- image generation. InCVPR, 2025. 13

work page 2025
[4]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InCVPR, 2023. 5 9

work page 2023
[5]

Unsupervised learn- ing of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. NeurIPS, 2020. 3

work page 2020
[6]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 3

work page 2021
[7]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 3, 9

work page 2020
[8]

Incremental false negative detection for contrastive learning

Tsai-Shien Chen, Wei-Chih Hung, Hung-Yu Tseng, Shao-Yi Chien, and Ming-Hsuan Yang. Incremental false negative detection for contrastive learning. InICLR, 2022. 3, 9

work page 2022
[9]

Motion-conditioned diffu- sion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023

Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023. 3

work page arXiv 2023
[10]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024

work page 2024
[11]

Multi-subject open-set personalization in video generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aber- man, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InCVPR, 2025. 2, 3

work page 2025
[12]

Visual categorization with bags of keypoints

Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and C ´edric Bray. Visual categorization with bags of keypoints. InECCVW, 2004. 2

work page 2004
[13]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 3

work page 2009
[14]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 3

work page 2021
[15]

VIMI: Grounding video gen- eration through multi-modal instruction

Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai- Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, and Sergey Tulyakov. VIMI: Grounding video gen- eration through multi-modal instruction. InEMNLP, 2024. 3

work page 2024
[16]

An image is worth one word: Personalizing text-to-image generation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InICLR, 2023. 2, 3

work page 2023
[17]

Tokenverse: Versatile multi-concept personalization in token modulation space.SIGGRAPH, 2025

Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space.SIGGRAPH, 2025. 3

work page 2025
[18]

Nano banana.https://aistudio.google

Google. Nano banana.https://aistudio.google. com/models/gemini-2-5-flash-image, 2025. 3

work page 2025
[19]

Preventing shortcuts in adapter training via providing the shortcuts

Anujraaj Argo Goyal, Guocheng Gordon Qian, Huseyin Coskun, Aarush Gupta, Himmy Tam, Daniil Ostashev, Ju Hu, Dhritiman Sagar, Sergey Tulyakov, Kfir Aberman, and Kuan-Chieh Jackson Wang. Preventing shortcuts in adapter training via providing the shortcuts. InNeurIPS, 2025. 16

work page 2025
[20]

Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020. 3

work page 2020
[21]

Livepor- trait: Efficient portrait animation with stitching and retarget- ing control.arXiv preprint arXiv:2407.03168, 2024

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Livepor- trait: Efficient portrait animation with stitching and retarget- ing control.arXiv preprint arXiv:2407.03168, 2024. 13

work page arXiv 2024
[22]

Conceptexpress: Harnessing diffusion models for single-image unsupervised concept extraction

Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, and Kwan-Yee K Wong. Conceptexpress: Harnessing diffusion models for single-image unsupervised concept extraction. In ECCV, 2024. 3

work page 2024
[23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

work page
[24]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InCVPR, 2020. 3

work page 2020
[25]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 3

work page 2022
[26]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021. 5

work page 2021
[28]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3

work page 2020
[29]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 5, 16

work page 2022
[30]

Instantx flux.1-dev ip-adapter page.https:// huggingface.co/InstantX/FLUX.1- dev- IP- Adapter, 2024

InstantX. Instantx flux.1-dev ip-adapter page.https:// huggingface.co/InstantX/FLUX.1- dev- IP- Adapter, 2024. 16

work page 2024
[31]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[32]

Imagenet classification with deep convolutional neural net- works.NeurIPS, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.NeurIPS, 2012. 3

work page 2012
[33]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InCVPR, 2023. 3

work page 2023
[34]

Flux.1-dev.https : / / huggingface

Black Forest Labs. Flux.1-dev.https : / / huggingface . co / black - forest - labs / FLUX.1-dev, 2024. 13, 15

work page 2024
[35]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Distortion-adaptive salient object detection in 360 omnidi- rectional images.IEEE Journal of Selected Topics in Signal Processing, 2019

Jia Li, Jinming Su, Changqun Xia, and Yonghong Tian. Distortion-adaptive salient object detection in 360 omnidi- rectional images.IEEE Journal of Selected Topics in Signal Processing, 2019. 13

work page 2019
[37]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 5, 16

work page 2023
[38]

Compositional visual generation with composable diffusion models

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InECCV, 2022. 5

work page 2022
[39]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 5, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InICCV, 2015. 7, 8

work page 2015
[41]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 16

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Visualiz- ing data using t-sne.Journal of machine learning research,

Laurens van der Maaten and Geoffrey Hinton. Visualiz- ing data using t-sne.Journal of machine learning research,

work page
[43]

Snap video: Scaled spatiotemporal transformers for text-to-video synthesis

Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. InCVPR, 2024. 3

work page 2024
[44]

SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InICLR, 2022. 3

work page 2022
[45]

Gpt-4o.https://openai.com/index/ hello-gpt-4o/, 2025

OpenAI. Gpt-4o.https://openai.com/index/ hello-gpt-4o/, 2025. 6, 7, 8, 16, 19

work page 2025
[46]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

work page 2024
[47]

Semantic segmentation of outdoor panoramic images.Signal, Image and Video Pro- cessing, 2021

Semih Orhan and Yalin Bastanlar. Semantic segmentation of outdoor panoramic images.Signal, Image and Video Pro- cessing, 2021. 13

work page 2021
[48]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3, 16

work page 2023
[49]

Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024. 7

work page arXiv 2024
[50]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Omni-id: Holistic identity represen- tation designed for generative tasks

Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen- Or, and Kfir Aberman. Omni-id: Holistic identity represen- tation designed for generative tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8786–8795, 2025. 3

work page 2025
[52]

Composeme: Attribute-specific image prompts for controllable human image generation

Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, and Kfir Aberman. Composeme: Attribute-specific image prompts for controllable human image generation. arXiv preprint arXiv:2509.18092, 2025

work page arXiv 2025
[53]

Layercomposer: Interactive personalized t2i via spatially-aware layered canvas.arXiv preprint arXiv:2510.20820, 2025

Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Osta- shev, et al. Layercomposer: Interactive personalized t2i via spatially-aware layered canvas.arXiv preprint arXiv:2510.20820, 2025. 3

work page arXiv 2025
[54]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2, 3, 5, 6, 7, 8, 17, 19

work page 2021
[55]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 3

work page 2023
[56]

Disentan- gling visual embeddings for attributes and objects

Nirat Saini, Khoi Pham, and Abhinav Shrivastava. Disentan- gling visual embeddings for attributes and objects. InCVPR,

work page
[57]

Instant- booth: Personalized text-to-image generation without test- time finetuning

Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instant- booth: Personalized text-to-image generation without test- time finetuning. InCVPR, 2024. 3

work page 2024
[58]

Lora vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228, 2024

Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. Lora vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228, 2024. 5, 9

work page arXiv 2024
[59]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Saliency in vr: How do people explore virtual envi- ronments?IEEE transactions on visualization and computer graphics, 2018

Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wet- zstein. Saliency in vr: How do people explore virtual envi- ronments?IEEE transactions on visualization and computer graphics, 2018. 13

work page 2018
[61]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015. 3

work page 2015
[62]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[63]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InCVPR, 2023. 3

work page 2023
[64]

Attention is all you need

A Vaswani. Attention is all you need. InNeurIPS, 2017. 3

work page 2017
[65]

Concept decomposition for visual exploration and inspiration.ACM TOG, 2023

Yael Vinker, Andrey V oynov, Daniel Cohen-Or, and Ariel Shamir. Concept decomposition for visual exploration and inspiration.ACM TOG, 2023. 3 11

work page 2023
[66]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 4, 5, 6, 7, 13, 14, 16, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 4, 14

work page 2022
[68]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 5, 6, 7, 13, 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025

Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, et al. Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025. 7

work page arXiv 2025
[70]

Omnigen: Unified image genera- tion

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InCVPR, 2025. 3, 5, 6, 7, 17

work page 2025
[71]

Language-guided visual perception dis- entanglement for image quality assessment and conditional image generation.arXiv preprint arXiv:2503.02206, 2025

Zhichao Yang, Leida Li, Pengfei Chen, Jinjian Wu, and Giuseppe Valenzise. Language-guided visual perception dis- entanglement for image quality assessment and conditional image generation.arXiv preprint arXiv:2503.02206, 2025. 3

work page arXiv 2025
[72]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 13

work page 2023
[74]

A fixation-based 360 benchmark dataset for salient object detection

Yi Zhang, Lu Zhang, Wassim Hamidouche, and Olivier De- forges. A fixation-based 360 benchmark dataset for salient object detection. InICIP, 2020. 13

work page 2020
[75]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien- Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experi- ences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

person identity,

Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, and Guanbin Li. Mod-adapter: Tuning-free and versatile multi-concept personalization via modulation adapter.arXiv preprint arXiv:2505.18612, 2025. 3 12 Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization Supplementary Material A. Implementatio...

work page arXiv 2025