arxiv: 2605.10127 · v2 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

Yu He , Ting Zhu , Yichun Liu , Lichen Ma , Xinyuan Shan , Jingling Fu , Yu Shi , Junshi Huang

show 1 more author

Yan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords Fashion130Koutfit generationmulti-modal promptsFusion Transformervisual consistencye-commerce datasetgenerative modelsprompt alignment

0 comments

The pith

Fashion130K dataset and UMC framework align text and image prompts to generate more consistent outfits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fashion130K, a new e-commerce dataset of 130,000 fashion images spanning various occasions, models, and garment types. It proposes the Unified Multi-modal Condition framework that combines text prompts with reference images for outfit generation. An embedding refiner extracts unified embeddings while a Fusion Transformer aligns the text and image modalities by adjusting their gap. The generation model's attention is then modified so the noise image focuses on pivotal prompt tokens. This setup targets the common problem of visual mismatches when mixing modalities, offering a practical path to more reliable clothing generation for design and retail tools.

Core claim

The authors present Fashion130K as a comprehensive dataset and the UMC framework, where an embedding refiner extracts unified embeddings from multi-modal prompts and a Fusion Transformer aligns text and image embeddings by closing the modality gap. The generation model's attention is redesigned to let the noise image select pivotal tokens from these unified prompts, resulting in more consistent garment generation compared to state-of-the-art methods on both the dataset and real-world applications.

What carries the argument

Unified Multi-modal Condition (UMC) with embedding refiner and Fusion Transformer, which aligns text and image embeddings to enable consistent outfit generation.

If this is right

Generation models produce outfits with higher visual consistency when using aligned multi-modal embeddings.
E-commerce tools gain improved results for design tasks that combine text descriptions and reference photos.
Redesigned attention lets noise images prioritize key tokens from prompts during the generation process.
Large datasets like Fashion130K support detailed testing of multi-modal conditions beyond single-modality setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The alignment approach could extend to other domains like furniture or product visualization where text and images are mixed.
If the gap closure works without information loss, separate text-only or image-only models may become less necessary for fashion tasks.
Real-world testing with unusual garment combinations would show whether the consistency gains hold for edge cases.

Load-bearing premise

The Fusion Transformer can reliably close the modality gap between text and image embeddings while preserving the information needed for consistent garment generation.

What would settle it

Generate outfits from paired text and image prompts that specify conflicting details such as color or style, then check whether UMC outputs match both inputs more accurately than baseline methods across a held-out test set.

Figures

Figures reproduced from arXiv: 2605.10127 by Jingling Fu, Junshi Huang, Lichen Ma, Ting Zhu, Xinyuan Shan, Yan Li, Yichun Liu, Yu He, Yu Shi.

**Figure 2.** Figure 2: Category distribution and visual examples from the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the UMC framework, which integrates a multi-modal Embedding Refiner and Selective Attention. We explore [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of different architectural choices on validation [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Visualization of Top-k (k = 8) Attention between noise tokens and multi-modal condition tokens. Compared Methods. We evaluate our method on the test set of Fashion130K, which contains 3000 garment-model pairs with rich captions. Our evaluation covers three categories of controllable image generation methods: fashion outfit methods (Magic Clothing [5], Any2AnyTryon [11]); subject-driven methods (OmniContr… view at source ↗

**Figure 5.** Figure 5: Qualitative results compared with various methods for fashion outfit generation on the test set of Fashion130K [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New e-commerce fashion dataset plus a Fusion Transformer for text-image alignment, but the superiority claims sit on missing metrics and ablations.

read the letter

The paper's main contribution is the Fashion130K dataset and a conditioning pipeline that refines embeddings then uses a Fusion Transformer to shrink the gap between text and image prompts before feeding them into a generator. The dataset covers a range of occasions, models, and garment types, which fills a practical gap for e-commerce work. The method redesigns attention so the noise image can pull relevant tokens from the unified prompts, aiming for more consistent outfits than prior approaches.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Fashion130K dataset, a large-scale e-commerce fashion collection covering diverse occasions, models, and garment types. It proposes a Unified Multi-modal Condition (UMC) framework that employs an embedding refiner and a Fusion Transformer to align text and image prompt embeddings by reducing the modality gap, followed by a redesigned attention mechanism in the generation model that emphasizes correlations between the prompts and the noise image to enable selection of pivotal tokens for consistent outfit generation. The authors claim that experiments on real-world applications and benchmarks show improved visual consistency over state-of-the-art methods.

Significance. If the central claims are supported by quantitative evidence, the work would supply a valuable new benchmark dataset for fashion generation and a concrete architectural approach to multi-modal prompt alignment that could generalize to other conditional generation tasks. The emphasis on unified embeddings and attention redesign directly targets a known challenge in text-image conditioning.

major comments (2)

Abstract: The claim of achieving 'promising result than that of SoTA methods' in visual consistency is unsupported by any quantitative metrics, error bars, ablation studies, or dataset statistics, leaving the central empirical claim without verifiable evidence.
Fusion Transformer description: No ablation isolating the transformer's effect, no pre/post-alignment similarity metrics, and no failure-case analysis are supplied to demonstrate that the redesigned attention reliably closes the text-image embedding gap while preserving garment-specific details needed for consistent generation.

minor comments (2)

Abstract: Grammatical issue in 'achieving promising result than that of SoTA methods'; revise to 'achieving more promising results than SoTA methods'.
Abstract: The phrase 'brand-new' is informal for a journal submission; consider 'new' or 'novel'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical claims and the need for additional analysis of the Fusion Transformer. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The claim of achieving 'promising result than that of SoTA methods' in visual consistency is unsupported by any quantitative metrics, error bars, ablation studies, or dataset statistics, leaving the central empirical claim without verifiable evidence.

Authors: We agree the abstract should be self-contained with quantitative support. The full manuscript reports results in Section 4 using FID, LPIPS, and user-study scores demonstrating improvements over SoTA baselines, along with dataset statistics in Section 3. In revision we will update the abstract to cite specific metrics (e.g., “reducing FID by 0.12 with 95% confidence intervals from five runs”) while retaining the high-level claim. revision: yes
Referee: Fusion Transformer description: No ablation isolating the transformer's effect, no pre/post-alignment similarity metrics, and no failure-case analysis are supplied to demonstrate that the redesigned attention reliably closes the text-image embedding gap while preserving garment-specific details needed for consistent generation.

Authors: We acknowledge the value of isolating the component’s contribution. We will add an ablation table removing the Fusion Transformer, report cosine-similarity scores between text and image embeddings before and after alignment, and include a qualitative failure-case subsection with examples showing both successful detail preservation and remaining limitations of the attention redesign. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset proposal or UMC framework design

full rationale

The paper introduces a new e-commerce dataset (Fashion130K) and a UMC framework consisting of an embedding refiner plus Fusion Transformer for multi-modal alignment. No mathematical derivations, equations, or predictions appear that reduce to fitted parameters, self-definitions, or self-citation chains. Claims of improved visual consistency rest on empirical experiments and comparisons to SoTA methods rather than any internal reduction to the inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed embedding refiner and Fusion Transformer for modality alignment, but the abstract provides no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5539 in / 1122 out tokens · 35918 ms · 2026-05-14T21:52:57.867053+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose the Fusion Transformer which learns independent representation before modality interaction and subsequently merges the separated representations into unified embedding by shared attention and MLP layers... Masked Self Attention... top-k attention
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 12 internal anchors

[1]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al. Imagen 3.arXiv preprint arXiv:2408.07009, 2024. 2

work page arXiv 2024
[2]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506,

work page
[3]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Magic clothing: Controllable garment-driven image synthesis

Weifeng Chen, Tao Gu, Yuhao Xu, and Arlene Chen. Magic clothing: Controllable garment-driven image synthesis. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 6939–6948, 2024. 6, 8

work page 2024
[6]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021. 1, 2, 3

work page 2021
[7]

Improving diffusion models for vir- tual try-on.arXiv e-prints, pages arXiv–2403, 2024

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for vir- tual try-on.arXiv e-prints, pages arXiv–2403, 2024. 1, 2

work page 2024
[8]

Towards multi-pose guided virtual try-on network

Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards multi-pose guided virtual try-on network. InProceedings of the IEEE/CVF international conference on computer vision, pages 9026–9035, 2019. 3

work page 2019
[9]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Generative adversarial nets.Advances in neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014. 2

work page 2014
[11]

Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks

Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Ji- aming Liu, and Chuang Zhang. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 19085–19096, 2025. 6, 8

work page 2025
[12]

Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information pro- cessing systems, 37:36777–36804, 2024

Zinan Guo, Yanze Wu, Chen Zhuowei, Peng Zhang, Qian He, et al. Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information pro- cessing systems, 37:36777–36804, 2024. 2

work page 2024
[13]

Viton: An image-based virtual try-on network

Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018. 3

work page 2018
[14]

Emma: Your text-to-image diffusion model can secretly accept multi-modal prompts

Yucheng Han, Rui Wang, Chi Zhang, Juntao Hu, Pei Cheng, Bin Fu, and Hanwang Zhang. Emma: Your text-to-image diffusion model can secretly accept multi-modal prompts. arXiv preprint arXiv:2406.09162, 2024. 1

work page arXiv 2024
[15]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1

work page 2020
[16]

An introduction to flow matching and diffusion models.arXiv preprint arXiv:2506.02070, 2025

Peter Holderrieth and Ezra Erives. An introduction to flow matching and diffusion models.arXiv preprint arXiv:2506.02070, 2025. 4

work page arXiv 2025
[17]

Fashionon: Semantic-guided image-based virtual try-on with detailed human and clothing information

Chia-Wei Hsieh, Chieh-Yun Chen, Chien-Lung Chou, Hong- Han Shuai, Jiaying Liu, and Wen-Huang Cheng. Fashionon: Semantic-guided image-based virtual try-on with detailed human and clothing information. InProceedings of the 27th ACM international conference on multimedia, pages 275– 283, 2019. 3

work page 2019
[18]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6

work page 2022
[19]

Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024

Team K. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024. 2

work page 2024
[20]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 2

work page 2019
[21]

Are these the same apple? comparing images based on object intrinsics.Advances in Neural Information Processing Systems, 36:40853–40871, 2023

Klemen Kotar, Stephen Tian, Hong-Xing Yu, Dan Yamins, and Jiajun Wu. Are these the same apple? comparing images based on object intrinsics.Advances in Neural Information Processing Systems, 36:40853–40871, 2023. 6

work page 2023
[22]

Tryongan: Body-aware try-on via layered interpolation.ACM Transactions on Graphics (TOG), 40(4):1–10, 2021

Kathleen M Lewis, Srivatsan Varadharajan, and Ira Kemelmacher-Shlizerman. Tryongan: Body-aware try-on via layered interpolation.ACM Transactions on Graphics (TOG), 40(4):1–10, 2021. 3

work page 2021
[23]

Toward accurate and realistic outfits visualization with atten- tion to details

Kedan Li, Min Jin Chong, Jeffrey Zhang, and Jingen Liu. Toward accurate and realistic outfits visualization with atten- tion to details. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15546– 15555, 2021. 3

work page 2021
[24]

Anyfit: Controllable virtual try- on for any combination of attire across any scenario.arXiv preprint arXiv:2405.18172, 2024

Yuhan Li, Hao Zhou, Wenxiang Shang, Ran Lin, Xuanhong Chen, and Bingbing Ni. Anyfit: Controllable virtual try- on for any combination of attire across any scenario.arXiv preprint arXiv:2405.18172, 2024. 2

work page arXiv 2024
[25]

net/forum?id=POWv6hDd9XH

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 2

work page arXiv 2024
[26]

Dual diffusion for unified image generation and understanding

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yu- val Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2779–2790, 2025. 4

work page 2025
[27]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Sys- tems, 35:17612–17625, 2022

Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Ser- ena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Sys- tems, 35:17612–17625, 2022. 4

work page 2022
[28]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1096–1104, 2016. 3

work page 2016
[31]

Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025. 6, 8

work page arXiv 2025
[32]

Dress code: High- resolution multi-category virtual try-on

Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High- resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022. 1, 3

work page 2022
[33]

Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915,

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915,

work page arXiv
[34]

Image based virtual try-on network from unpaired data

Assaf Neuberger, Eran Borenstein, Bar Hilleli, Eduard Oks, and Sharon Alpert. Image based virtual try-on network from unpaired data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5184– 5193, 2020. 3

work page 2020
[35]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[37]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv preprint arXiv:2503.21758, 2025. 2

work page arXiv 2025
[39]

Accept the modality gap: An exploration in the hyperbolic space

Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Ajanthan Thalaiyasingam. Accept the modality gap: An exploration in the hyperbolic space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27263–27272, 2024. 4

work page 2024
[40]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2

work page 2022
[42]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2

work page 2023
[43]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2

work page 2022
[44]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010
[45]

Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024. 6, 8

work page arXiv 2024
[46]

Mv-vton: Multi-view virtual try-on with diffusion models

Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, and Wangmeng Zuo. Mv-vton: Multi-view virtual try-on with diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7682–7690, 2025. 3

work page 2025
[47]

Stablegar- ment: Garment-centric generation via stable diffusion.arXiv preprint arXiv:2403.10783, 2024

Rui Wang, Hailong Guo, Jiaming Liu, Huaxia Li, Haibo Zhao, Xu Tang, Yao Hu, Hao Tang, and Peipei Li. Stablegar- ment: Garment-centric generation via stable diffusion.arXiv preprint arXiv:2403.10783, 2024. 1, 2

work page arXiv 2024
[48]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 2, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Uso: Unified style and subject-driven generation via disentangled and reward learning.arXiv preprint arXiv:2508.18966, 2025

Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Ji- ahe Tian, Yiming Luo, Fei Ding, and Qian He. Uso: Unified style and subject-driven generation via disentangled and re- ward learning.arXiv preprint arXiv:2508.18966, 2025. 6, 8

work page arXiv 2025
[51]

Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025. 6, 8

work page arXiv 2025
[52]

Omnigen: Unified image genera- tion

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 6, 8

work page 2025
[53]

Ootd- iffusion: Outfitting fusion based latent diffusion for control- lable virtual try-on

Yuhao Xu, Tao Gu, Weifeng Chen, and Arlene Chen. Ootd- iffusion: Outfitting fusion based latent diffusion for control- lable virtual try-on. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8996–9004, 2025. 1

work page 2025
[54]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Generating high-resolution fashion model images wearing custom outfits

Gokhan Yildirim, Nikolay Jetchev, Roland V ollgraf, and Urs Bergmann. Generating high-resolution fashion model images wearing custom outfits. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 3

work page 2019
[56]

Pixel-level domain transfer

Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S Paek, and In So Kweon. Pixel-level domain transfer. In European conference on computer vision, pages 517–532. Springer, 2016. 3

work page 2016
[57]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

work page 2018
[58]

Virtually trying on new clothing with arbitrary poses

Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang Nie. Virtually trying on new clothing with arbitrary poses. InProceedings of the 27th ACM inter- national conference on multimedia, pages 266–274, 2019. 3

work page 2019
[59]

Learning flow fields in attention for controllable person image generation

Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan- Manuel P´erez-R´ua, et al. Learning flow fields in attention for controllable person image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2491–2501, 2025. 2

work page 2025