Recognition: 2 theorem links
· Lean TheoremOmni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Pith reviewed 2026-05-16 22:50 UTC · model grok-4.3
The pith
Omni-Attribute is the first open-vocabulary encoder that learns isolated representations for single visual attributes like identity or lighting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model by curating semantically linked image pairs annotated with positive and negative attributes and by adopting a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
What carries the argument
Semantically linked positive-negative image pairs trained with a dual objective that rewards both accurate image reconstruction and contrastive separation of the target attribute.
Load-bearing premise
Curating image pairs that differ in only one annotated attribute and training with dual fidelity and contrastive objectives will isolate that attribute without leakage into other factors.
What would settle it
If retrieval experiments show that an attribute embedding still correlates with unrelated factors such as lighting when only identity was labeled, or if personalization outputs alter non-target regions, the isolation claim would be falsified.
Figures
read the original abstract
Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Omni-Attribute, the first open-vocabulary image attribute encoder for visual concept personalization. It jointly designs data curation of semantically linked image pairs annotated with positive/negative attributes and a dual-objective training paradigm (generative fidelity plus contrastive disentanglement) to produce high-fidelity, attribute-specific embeddings. These embeddings are claimed to enable effective open-vocabulary attribute retrieval, personalization, and compositional generation while achieving SOTA performance across multiple benchmarks.
Significance. If the central claims hold, the work would meaningfully advance visual concept personalization by addressing entanglement in holistic embeddings, enabling more precise attribute transfer without leakage. The joint data-model design and open-vocabulary capability are strengths; reproducible code or machine-checked elements are not mentioned but would further strengthen impact if present.
major comments (2)
- [§3] §3 (Method, dual-objective training): The central claim that contrastive disentanglement on positive/negative pairs isolates single attributes without residual entanglement from correlated factors (e.g., lighting and expression) is load-bearing but unsupported by explicit leakage metrics or independence regularizers. The curation assumption that pairs differ only along the annotated attribute requires quantitative validation in experiments, as statistical correlations in visual data could undermine the attribute-specific representations.
- [§4] §4 (Experiments): The abstract asserts SOTA performance across benchmarks, yet no specific metrics, baselines, ablations, or error analysis are referenced in the provided text. Tables reporting quantitative results (e.g., retrieval accuracy, personalization FID) with comparisons are needed to substantiate the effectiveness claim; without them the evaluation is incomplete.
minor comments (2)
- [Abstract] Abstract: The description of 'semantically linked image pairs' could be clarified with an example or formal definition to make the curation process more transparent.
- [§3] Notation: Ensure consistent use of terms like 'generative fidelity' and 'contrastive disentanglement' when first introduced, with explicit loss formulations if equations are present.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our claims where needed.
read point-by-point responses
-
Referee: [§3] §3 (Method, dual-objective training): The central claim that contrastive disentanglement on positive/negative pairs isolates single attributes without residual entanglement from correlated factors (e.g., lighting and expression) is load-bearing but unsupported by explicit leakage metrics or independence regularizers. The curation assumption that pairs differ only along the annotated attribute requires quantitative validation in experiments, as statistical correlations in visual data could undermine the attribute-specific representations.
Authors: We agree that explicit quantitative support for the disentanglement claim is important. Our dual-objective training combines generative fidelity with contrastive losses on the curated pairs to encourage isolation, but we acknowledge the need for direct metrics. In the revision we will add leakage analysis (e.g., pairwise attribute correlation in the learned embeddings) and a quantitative check on the curation assumption via statistical tests on the training pairs and human verification of attribute isolation. We will also report an ablation with an added independence regularizer to quantify its effect. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract asserts SOTA performance across benchmarks, yet no specific metrics, baselines, ablations, or error analysis are referenced in the provided text. Tables reporting quantitative results (e.g., retrieval accuracy, personalization FID) with comparisons are needed to substantiate the effectiveness claim; without them the evaluation is incomplete.
Authors: We apologize for the lack of explicit references in the reviewed version. The full manuscript contains Section 4 with Tables 1–3 reporting retrieval accuracy, personalization FID, and compositional generation metrics, together with comparisons to CLIP, DINO, and prior personalization baselines, plus ablations on the dual objectives. We will revise the abstract and method section to directly cite these tables and add a short error analysis subsection. revision: partial
Circularity Check
No circularity detected in derivation chain
full rationale
The paper presents a methodological contribution consisting of data curation of semantically linked image pairs and a dual-objective training paradigm (generative fidelity plus contrastive disentanglement). No equations, derivations, or parameter-fitting steps are described in the provided text that reduce by construction to the inputs or to self-citations. The central claims rest on empirical training outcomes and external benchmarks rather than any self-referential definition, fitted-input prediction, or load-bearing self-citation chain. The approach is self-contained against standard contrastive objectives and data curation practices.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual-objective training paradigm that balances generative fidelity with contrastive disentanglement
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
semantically linked image pairs annotated with positive and negative attributes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Break-a-scene: Extracting multiple concepts from a single image
Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen- Or, and Dani Lischinski. Break-a-scene: Extracting multiple concepts from a single image. InSIGGRAPH Asia, 2023. 3
work page 2023
-
[2]
Animal image dataset.https://www
Sourav Banerjee. Animal image dataset.https://www. kaggle . com / datasets / iamsouravbanerjee / animal - image - dataset - 90 - different - animals, 2024. 7
work page 2024
-
[3]
Precisecam: Precise camera control for text-to- image generation
Edurne Bernal-Berdun, Ana Serrano, Belen Masia, Matheus Gadelha, Yannick Hold-Geoffroy, Xin Sun, and Diego Gutierrez. Precisecam: Precise camera control for text-to- image generation. InCVPR, 2025. 13
work page 2025
-
[4]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InCVPR, 2023. 5 9
work page 2023
-
[5]
Unsupervised learn- ing of visual features by contrasting cluster assignments
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. NeurIPS, 2020. 3
work page 2020
-
[6]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 3
work page 2021
-
[7]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 3, 9
work page 2020
-
[8]
Incremental false negative detection for contrastive learning
Tsai-Shien Chen, Wei-Chih Hung, Hung-Yu Tseng, Shao-Yi Chien, and Ming-Hsuan Yang. Incremental false negative detection for contrastive learning. InICLR, 2022. 3, 9
work page 2022
-
[9]
Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023. 3
-
[10]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024
work page 2024
-
[11]
Multi-subject open-set personalization in video generation
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aber- man, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InCVPR, 2025. 2, 3
work page 2025
-
[12]
Visual categorization with bags of keypoints
Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and C ´edric Bray. Visual categorization with bags of keypoints. InECCVW, 2004. 2
work page 2004
-
[13]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 3
work page 2009
-
[14]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 3
work page 2021
-
[15]
VIMI: Grounding video gen- eration through multi-modal instruction
Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai- Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, and Sergey Tulyakov. VIMI: Grounding video gen- eration through multi-modal instruction. InEMNLP, 2024. 3
work page 2024
-
[16]
An image is worth one word: Personalizing text-to-image generation using textual inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InICLR, 2023. 2, 3
work page 2023
-
[17]
Tokenverse: Versatile multi-concept personalization in token modulation space.SIGGRAPH, 2025
Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space.SIGGRAPH, 2025. 3
work page 2025
-
[18]
Nano banana.https://aistudio.google
Google. Nano banana.https://aistudio.google. com/models/gemini-2-5-flash-image, 2025. 3
work page 2025
-
[19]
Preventing shortcuts in adapter training via providing the shortcuts
Anujraaj Argo Goyal, Guocheng Gordon Qian, Huseyin Coskun, Aarush Gupta, Himmy Tam, Daniil Ostashev, Ju Hu, Dhritiman Sagar, Sergey Tulyakov, Kfir Aberman, and Kuan-Chieh Jackson Wang. Preventing shortcuts in adapter training via providing the shortcuts. InNeurIPS, 2025. 16
work page 2025
-
[20]
Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020
Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.NeurIPS, 2020. 3
work page 2020
-
[21]
Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Livepor- trait: Efficient portrait animation with stitching and retarget- ing control.arXiv preprint arXiv:2407.03168, 2024. 13
-
[22]
Conceptexpress: Harnessing diffusion models for single-image unsupervised concept extraction
Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, and Kwan-Yee K Wong. Conceptexpress: Harnessing diffusion models for single-image unsupervised concept extraction. In ECCV, 2024. 3
work page 2024
-
[23]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,
-
[24]
Momentum contrast for unsupervised visual rep- resentation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InCVPR, 2020. 3
work page 2020
-
[25]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 3
work page 2022
-
[26]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021. 5
work page 2021
-
[28]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3
work page 2020
-
[29]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 5, 16
work page 2022
-
[30]
Instantx flux.1-dev ip-adapter page.https:// huggingface.co/InstantX/FLUX.1- dev- IP- Adapter, 2024
InstantX. Instantx flux.1-dev ip-adapter page.https:// huggingface.co/InstantX/FLUX.1- dev- IP- Adapter, 2024. 16
work page 2024
-
[31]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[32]
Imagenet classification with deep convolutional neural net- works.NeurIPS, 2012
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.NeurIPS, 2012. 3
work page 2012
-
[33]
Multi-concept customization of text-to-image diffusion
Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InCVPR, 2023. 3
work page 2023
-
[34]
Flux.1-dev.https : / / huggingface
Black Forest Labs. Flux.1-dev.https : / / huggingface . co / black - forest - labs / FLUX.1-dev, 2024. 13, 15
work page 2024
-
[35]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Jia Li, Jinming Su, Changqun Xia, and Yonghong Tian. Distortion-adaptive salient object detection in 360 omnidi- rectional images.IEEE Journal of Selected Topics in Signal Processing, 2019. 13
work page 2019
-
[37]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 5, 16
work page 2023
-
[38]
Compositional visual generation with composable diffusion models
Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InECCV, 2022. 5
work page 2022
-
[39]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 5, 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Deep learning face attributes in the wild
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InICCV, 2015. 7, 8
work page 2015
-
[41]
Decoupled Weight Decay Regularization
I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 16
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
Visualiz- ing data using t-sne.Journal of machine learning research,
Laurens van der Maaten and Geoffrey Hinton. Visualiz- ing data using t-sne.Journal of machine learning research,
-
[43]
Snap video: Scaled spatiotemporal transformers for text-to-video synthesis
Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. InCVPR, 2024. 3
work page 2024
-
[44]
SDEdit: Guided image synthesis and editing with stochastic differential equa- tions
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InICLR, 2022. 3
work page 2022
-
[45]
Gpt-4o.https://openai.com/index/ hello-gpt-4o/, 2025
OpenAI. Gpt-4o.https://openai.com/index/ hello-gpt-4o/, 2025. 6, 7, 8, 16, 19
work page 2025
-
[46]
Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...
work page 2024
-
[47]
Semantic segmentation of outdoor panoramic images.Signal, Image and Video Pro- cessing, 2021
Semih Orhan and Yalin Bastanlar. Semantic segmentation of outdoor panoramic images.Signal, Image and Video Pro- cessing, 2021. 13
work page 2021
-
[48]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3, 16
work page 2023
-
[49]
Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024. 7
-
[50]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Omni-id: Holistic identity represen- tation designed for generative tasks
Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen- Or, and Kfir Aberman. Omni-id: Holistic identity represen- tation designed for generative tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8786–8795, 2025. 3
work page 2025
-
[52]
Composeme: Attribute-specific image prompts for controllable human image generation
Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, and Kfir Aberman. Composeme: Attribute-specific image prompts for controllable human image generation. arXiv preprint arXiv:2509.18092, 2025
-
[53]
Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Osta- shev, et al. Layercomposer: Interactive personalized t2i via spatially-aware layered canvas.arXiv preprint arXiv:2510.20820, 2025. 3
-
[54]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2, 3, 5, 6, 7, 8, 17, 19
work page 2021
-
[55]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 3
work page 2023
-
[56]
Disentan- gling visual embeddings for attributes and objects
Nirat Saini, Khoi Pham, and Abhinav Shrivastava. Disentan- gling visual embeddings for attributes and objects. InCVPR,
-
[57]
Instant- booth: Personalized text-to-image generation without test- time finetuning
Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instant- booth: Personalized text-to-image generation without test- time finetuning. InCVPR, 2024. 3
work page 2024
-
[58]
Lora vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228, 2024
Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. Lora vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228, 2024. 5, 9
-
[59]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wet- zstein. Saliency in vr: How do people explore virtual envi- ronments?IEEE transactions on visualization and computer graphics, 2018. 13
work page 2018
-
[61]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015. 3
work page 2015
-
[62]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[63]
Plug-and-play diffusion features for text-driven image-to-image translation
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InCVPR, 2023. 3
work page 2023
- [64]
-
[65]
Concept decomposition for visual exploration and inspiration.ACM TOG, 2023
Yael Vinker, Andrey V oynov, Daniel Cohen-Or, and Ariel Shamir. Concept decomposition for visual exploration and inspiration.ACM TOG, 2023. 3 11
work page 2023
-
[66]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 4, 5, 6, 7, 13, 14, 16, 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 4, 14
work page 2022
-
[68]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 3, 5, 6, 7, 13, 17
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, et al. Dreamomni2: Multimodal instruction-based editing and generation.arXiv preprint arXiv:2510.06679, 2025. 7
-
[70]
Omnigen: Unified image genera- tion
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InCVPR, 2025. 3, 5, 6, 7, 17
work page 2025
-
[71]
Zhichao Yang, Leida Li, Pengfei Chen, Jinjian Wu, and Giuseppe Valenzise. Language-guided visual perception dis- entanglement for image quality assessment and conditional image generation.arXiv preprint arXiv:2503.02206, 2025. 3
-
[72]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 13
work page 2023
-
[74]
A fixation-based 360 benchmark dataset for salient object detection
Yi Zhang, Lu Zhang, Wassim Hamidouche, and Olivier De- forges. A fixation-based 360 benchmark dataset for salient object detection. InICIP, 2020. 13
work page 2020
-
[75]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien- Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experi- ences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, and Guanbin Li. Mod-adapter: Tuning-free and versatile multi-concept personalization via modulation adapter.arXiv preprint arXiv:2505.18612, 2025. 3 12 Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization Supplementary Material A. Implementatio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.