Recognition: 2 theorem links
· Lean TheoremFashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Pith reviewed 2026-05-14 21:52 UTC · model grok-4.3
The pith
Fashion130K dataset and UMC framework align text and image prompts to generate more consistent outfits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present Fashion130K as a comprehensive dataset and the UMC framework, where an embedding refiner extracts unified embeddings from multi-modal prompts and a Fusion Transformer aligns text and image embeddings by closing the modality gap. The generation model's attention is redesigned to let the noise image select pivotal tokens from these unified prompts, resulting in more consistent garment generation compared to state-of-the-art methods on both the dataset and real-world applications.
What carries the argument
Unified Multi-modal Condition (UMC) with embedding refiner and Fusion Transformer, which aligns text and image embeddings to enable consistent outfit generation.
If this is right
- Generation models produce outfits with higher visual consistency when using aligned multi-modal embeddings.
- E-commerce tools gain improved results for design tasks that combine text descriptions and reference photos.
- Redesigned attention lets noise images prioritize key tokens from prompts during the generation process.
- Large datasets like Fashion130K support detailed testing of multi-modal conditions beyond single-modality setups.
Where Pith is reading between the lines
- The alignment approach could extend to other domains like furniture or product visualization where text and images are mixed.
- If the gap closure works without information loss, separate text-only or image-only models may become less necessary for fashion tasks.
- Real-world testing with unusual garment combinations would show whether the consistency gains hold for edge cases.
Load-bearing premise
The Fusion Transformer can reliably close the modality gap between text and image embeddings while preserving the information needed for consistent garment generation.
What would settle it
Generate outfits from paired text and image prompts that specify conflicting details such as color or style, then check whether UMC outputs match both inputs more accurately than baseline methods across a held-out test set.
Figures
read the original abstract
Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Fashion130K dataset, a large-scale e-commerce fashion collection covering diverse occasions, models, and garment types. It proposes a Unified Multi-modal Condition (UMC) framework that employs an embedding refiner and a Fusion Transformer to align text and image prompt embeddings by reducing the modality gap, followed by a redesigned attention mechanism in the generation model that emphasizes correlations between the prompts and the noise image to enable selection of pivotal tokens for consistent outfit generation. The authors claim that experiments on real-world applications and benchmarks show improved visual consistency over state-of-the-art methods.
Significance. If the central claims are supported by quantitative evidence, the work would supply a valuable new benchmark dataset for fashion generation and a concrete architectural approach to multi-modal prompt alignment that could generalize to other conditional generation tasks. The emphasis on unified embeddings and attention redesign directly targets a known challenge in text-image conditioning.
major comments (2)
- Abstract: The claim of achieving 'promising result than that of SoTA methods' in visual consistency is unsupported by any quantitative metrics, error bars, ablation studies, or dataset statistics, leaving the central empirical claim without verifiable evidence.
- Fusion Transformer description: No ablation isolating the transformer's effect, no pre/post-alignment similarity metrics, and no failure-case analysis are supplied to demonstrate that the redesigned attention reliably closes the text-image embedding gap while preserving garment-specific details needed for consistent generation.
minor comments (2)
- Abstract: Grammatical issue in 'achieving promising result than that of SoTA methods'; revise to 'achieving more promising results than SoTA methods'.
- Abstract: The phrase 'brand-new' is informal for a journal submission; consider 'new' or 'novel'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the empirical claims and the need for additional analysis of the Fusion Transformer. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The claim of achieving 'promising result than that of SoTA methods' in visual consistency is unsupported by any quantitative metrics, error bars, ablation studies, or dataset statistics, leaving the central empirical claim without verifiable evidence.
Authors: We agree the abstract should be self-contained with quantitative support. The full manuscript reports results in Section 4 using FID, LPIPS, and user-study scores demonstrating improvements over SoTA baselines, along with dataset statistics in Section 3. In revision we will update the abstract to cite specific metrics (e.g., “reducing FID by 0.12 with 95% confidence intervals from five runs”) while retaining the high-level claim. revision: yes
-
Referee: Fusion Transformer description: No ablation isolating the transformer's effect, no pre/post-alignment similarity metrics, and no failure-case analysis are supplied to demonstrate that the redesigned attention reliably closes the text-image embedding gap while preserving garment-specific details needed for consistent generation.
Authors: We acknowledge the value of isolating the component’s contribution. We will add an ablation table removing the Fusion Transformer, report cosine-similarity scores between text and image embeddings before and after alignment, and include a qualitative failure-case subsection with examples showing both successful detail preservation and remaining limitations of the attention redesign. revision: yes
Circularity Check
No significant circularity in dataset proposal or UMC framework design
full rationale
The paper introduces a new e-commerce dataset (Fashion130K) and a UMC framework consisting of an embedding refiner plus Fusion Transformer for multi-modal alignment. No mathematical derivations, equations, or predictions appear that reduce to fitted parameters, self-definitions, or self-citation chains. Claims of improved visual consistency rest on empirical experiments and comparisons to SoTA methods rather than any internal reduction to the inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose the Fusion Transformer which learns independent representation before modality interaction and subsequently merges the separated representations into unified embedding by shared attention and MLP layers... Masked Self Attention... top-k attention
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Imagen 3.arXiv preprint arXiv:2408.07009, 2024
Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024. 2
-
[2]
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506,
-
[3]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Magic clothing: Controllable garment-driven image synthesis
Weifeng Chen, Tao Gu, Yuhao Xu, and Arlene Chen. Magic clothing: Controllable garment-driven image synthesis. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 6939–6948, 2024. 6, 8
work page 2024
-
[6]
Viton-hd: High-resolution virtual try-on via misalignment-aware normalization
Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021. 1, 2, 3
work page 2021
-
[7]
Improving diffusion models for vir- tual try-on.arXiv e-prints, pages arXiv–2403, 2024
Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for vir- tual try-on.arXiv e-prints, pages arXiv–2403, 2024. 1, 2
work page 2024
-
[8]
Towards multi-pose guided virtual try-on network
Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards multi-pose guided virtual try-on network. InProceedings of the IEEE/CVF international conference on computer vision, pages 9026–9035, 2019. 3
work page 2019
-
[9]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Generative adversarial nets.Advances in neural information processing systems, 27, 2014
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 2
work page 2014
-
[11]
Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks
Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Ji- aming Liu, and Chuang Zhang. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 19085–19096, 2025. 6, 8
work page 2025
-
[12]
Zinan Guo, Yanze Wu, Chen Zhuowei, Peng Zhang, Qian He, et al. Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information pro- cessing systems, 37:36777–36804, 2024. 2
work page 2024
-
[13]
Viton: An image-based virtual try-on network
Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018. 3
work page 2018
-
[14]
Emma: Your text-to-image diffusion model can secretly accept multi-modal prompts
Yucheng Han, Rui Wang, Chi Zhang, Juntao Hu, Pei Cheng, Bin Fu, and Hanwang Zhang. Emma: Your text-to-image diffusion model can secretly accept multi-modal prompts. arXiv preprint arXiv:2406.09162, 2024. 1
-
[15]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1
work page 2020
-
[16]
An introduction to flow matching and diffusion models.arXiv preprint arXiv:2506.02070, 2025
Peter Holderrieth and Ezra Erives. An introduction to flow matching and diffusion models.arXiv preprint arXiv:2506.02070, 2025. 4
-
[17]
Fashionon: Semantic-guided image-based virtual try-on with detailed human and clothing information
Chia-Wei Hsieh, Chieh-Yun Chen, Chien-Lung Chou, Hong- Han Shuai, Jiaying Liu, and Wen-Huang Cheng. Fashionon: Semantic-guided image-based virtual try-on with detailed human and clothing information. InProceedings of the 27th ACM international conference on multimedia, pages 275– 283, 2019. 3
work page 2019
-
[18]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6
work page 2022
-
[19]
Team K. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024. 2
work page 2024
-
[20]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 2
work page 2019
-
[21]
Klemen Kotar, Stephen Tian, Hong-Xing Yu, Dan Yamins, and Jiajun Wu. Are these the same apple? comparing images based on object intrinsics.Advances in Neural Information Processing Systems, 36:40853–40871, 2023. 6
work page 2023
-
[22]
Kathleen M Lewis, Srivatsan Varadharajan, and Ira Kemelmacher-Shlizerman. Tryongan: Body-aware try-on via layered interpolation.ACM Transactions on Graphics (TOG), 40(4):1–10, 2021. 3
work page 2021
-
[23]
Toward accurate and realistic outfits visualization with atten- tion to details
Kedan Li, Min Jin Chong, Jeffrey Zhang, and Jingen Liu. Toward accurate and realistic outfits visualization with atten- tion to details. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15546– 15555, 2021. 3
work page 2021
-
[24]
Yuhan Li, Hao Zhou, Wenxiang Shang, Ran Lin, Xuanhong Chen, and Bingbing Ni. Anyfit: Controllable virtual try- on for any combination of attire across any scenario.arXiv preprint arXiv:2405.18172, 2024. 2
-
[25]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 2
-
[26]
Dual diffusion for unified image generation and understanding
Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yu- val Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2779–2790, 2025. 4
work page 2025
-
[27]
Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Ser- ena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Sys- tems, 35:17612–17625, 2022. 4
work page 2022
-
[28]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Deepfashion: Powering robust clothes recognition and retrieval with rich annotations
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1096–1104, 2016. 3
work page 2016
-
[31]
Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025. 6, 8
-
[32]
Dress code: High- resolution multi-category virtual try-on
Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High- resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022. 1, 3
work page 2022
-
[33]
Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915,
Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915,
-
[34]
Image based virtual try-on network from unpaired data
Assaf Neuberger, Eran Borenstein, Bar Hilleli, Eduard Oks, and Sharon Alpert. Image based virtual try-on network from unpaired data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5184– 5193, 2020. 3
work page 2020
-
[35]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[37]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025. 2
-
[39]
Accept the modality gap: An exploration in the hyperbolic space
Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Ajanthan Thalaiyasingam. Accept the modality gap: An exploration in the hyperbolic space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27263–27272, 2024. 4
work page 2024
-
[40]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2
work page 2022
-
[42]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2
work page 2023
-
[43]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2
work page 2022
-
[44]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 1
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[45]
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024. 6, 8
-
[46]
Mv-vton: Multi-view virtual try-on with diffusion models
Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, and Wangmeng Zuo. Mv-vton: Multi-view virtual try-on with diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7682–7690, 2025. 3
work page 2025
-
[47]
Rui Wang, Hailong Guo, Jiaming Liu, Huaxia Li, Haibo Zhao, Xu Tang, Yao Hu, Hao Tang, and Peipei Li. Stablegar- ment: Garment-centric generation via stable diffusion.arXiv preprint arXiv:2403.10783, 2024. 1, 2
-
[48]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 2, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Ji- ahe Tian, Yiming Luo, Fei Ding, and Qian He. Uso: Unified style and subject-driven generation via disentangled and re- ward learning.arXiv preprint arXiv:2508.18966, 2025. 6, 8
-
[51]
Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025. 6, 8
-
[52]
Omnigen: Unified image genera- tion
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 6, 8
work page 2025
-
[53]
Ootd- iffusion: Outfitting fusion based latent diffusion for control- lable virtual try-on
Yuhao Xu, Tao Gu, Weifeng Chen, and Arlene Chen. Ootd- iffusion: Outfitting fusion based latent diffusion for control- lable virtual try-on. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8996–9004, 2025. 1
work page 2025
-
[54]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Generating high-resolution fashion model images wearing custom outfits
Gokhan Yildirim, Nikolay Jetchev, Roland V ollgraf, and Urs Bergmann. Generating high-resolution fashion model images wearing custom outfits. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 3
work page 2019
-
[56]
Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S Paek, and In So Kweon. Pixel-level domain transfer. In European conference on computer vision, pages 517–532. Springer, 2016. 3
work page 2016
-
[57]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6
work page 2018
-
[58]
Virtually trying on new clothing with arbitrary poses
Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang Nie. Virtually trying on new clothing with arbitrary poses. InProceedings of the 27th ACM inter- national conference on multimedia, pages 266–274, 2019. 3
work page 2019
-
[59]
Learning flow fields in attention for controllable person image generation
Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan- Manuel P´erez-R´ua, et al. Learning flow fields in attention for controllable person image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2491–2501, 2025. 2
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.