RefTon: Reference person shot assist virtual Try-on

Bo Cheng; Dawei Leng; Dengyang Jiang; Leibucha Wu; Liuzhuozheng Li; Shanyuan Liu; Yue Gong; Yuhang Ma; Yuhui Yin; Zanyi Wang

arxiv: 2511.00956 · v6 · submitted 2025-11-02 · 💻 cs.CV

RefTon: Reference person shot assist virtual Try-on

Liuzhuozheng Li , Yue Gong , Shanyuan Liu , Dengyang Jiang , Zanyi Wang , Bo Cheng , Yuhang Ma , Leibucha Wu

show 2 more authors

Dawei Leng Yuhui Yin

This is my paper

Pith reviewed 2026-05-18 01:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords virtual try-onunpaired referencesgarment transfertexture alignmentperson-to-person generationimage synthesisreference-guided editingflux model

0 comments

The pith

RefTon generates virtual try-on results directly from source person and target garment images by adding unpaired reference photos for texture refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

RefTon is a virtual try-on framework that takes a source person image and a target garment image and directly outputs the person wearing the garment. It avoids standard complex steps such as body parsing or mask warping by using a flux-based model with no auxiliary branches for handling inputs. The central addition is a set of unpaired reference images showing the same garment on other people, which supply guidance to align textures and preserve details. A new dataset of these references supports training the system. Experiments on public benchmarks show results that match or exceed those of current leading methods while keeping the overall design simple and efficient.

Core claim

RefTon is a flux-based person-to-person virtual try-on framework that directly generates try-on results from a source image and a target garment without structural guidance or auxiliary components. It leverages additional unpaired reference images of the target garment worn on different individuals to refine texture alignment and maintain garment details, enabled by a newly built dataset of such references. Extensive experiments on public benchmarks demonstrate competitive or superior performance compared to state-of-the-art methods while preserving a simple and efficient design.

What carries the argument

Unpaired reference images of the target garment on different individuals, used to guide texture alignment and detail preservation inside a direct flux-based person-to-person generation process.

If this is right

The virtual try-on pipeline simplifies by removing the need for body parsing, warped masks, or separate input-handling branches.
Garment details are preserved more effectively through direct reference-based refinement during generation.
Training can proceed with unpaired data thanks to the constructed reference dataset.
Competitive benchmark performance holds even with the streamlined person-to-person structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reference-image approach could extend to other conditional image generation tasks that benefit from cross-example consistency cues.
Reduced architectural complexity may support faster inference or easier integration into mobile virtual try-on applications.
Selecting or synthesizing the most useful reference images automatically could become a natural next step for robustness.
The method may lower data requirements for similar garment-transfer problems by relying on unpaired rather than strictly paired examples.

Load-bearing premise

Unpaired reference images of the target garment on different individuals supply reliable and consistent guidance for texture alignment and detail preservation without introducing new artifacts.

What would settle it

Visible texture mismatches, lost garment details, or new artifacts appearing in try-on outputs on patterned clothing or extreme poses when reference images are included would show the guidance is not reliable.

Figures

Figures reproduced from arXiv: 2511.00956 by Bo Cheng, Dawei Leng, Dengyang Jiang, Leibucha Wu, Liuzhuozheng Li, Shanyuan Liu, Yue Gong, Yuhang Ma, Yuhui Yin, Zanyi Wang.

**Figure 2.** Figure 2: The effect of using reference images for the virtual [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The pipeline of our two-stage training strategy: (a) In the first stage, which follows a similar paradigm to [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Adaptation of a three-channel position index: the first channel encodes different conditional inputs, while the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The overall pipeline of generating the reference images. We first generate the appearance descriptions using [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on the VITON dataset. and the model is trained following the pipeline in Fig. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on the DressCode dataset., and the model is trained following the pipeline in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of the ablation study across different settings. “Ref.” denotes that a reference image is provided, while “MF” indicates mask-free inputs using the original person image instead of a masked agnostic image. We conduct an ablation study to examine our model under four settings (w/&w/o mask, w/&w/o Ref.). As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Text prompts from the Outfit and Action Description Bank. To ensure the model edits only the person while [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Sample reference images generated by our reference data generation pipeline. The editing model takes [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative paired results in VITON-HD dataset with complex patterns on clothes. “reference” denotes that a reference image is provided. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative paired results in VITON-HD dataset with complex structure on clothes. “reference” denotes that a reference image is provided. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative results of upper-body sub-set in Dresscode dataset unpaired setting. “reference” denotes that a reference image is provided, while “MF” indicates mask-free inputs using the original person image instead of a masked agnostic image. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative results of lower-body sub-set in Dresscode dataset unpaired setting. “reference” denotes that a reference image is provided, while “MF” indicates mask-free inputs using the original person image instead of a masked agnostic image. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative results of dresses sub-set in Dresscode dataset unpaired setting. “reference” denotes that a reference image is provided, while “MF” indicates mask-free inputs using the original person image instead of a masked agnostic image. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

read the original abstract

We introduce RefTon, a flux-based person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. Unlike conventional approaches that rely on complex auxiliary inputs such as body parsing and warped mask or require finely designed extract branches to process various input conditions, RefTon streamlines the process by directly generating try-on results from a source image and a target garment, without the need for structural guidance or auxiliary components to handle diverse inputs. Moreover, inspired by human clothing selection behavior, RefTon leverages additional reference images (the target garment worn on different individuals) to provide powerful guidance for refining texture alignment and maintaining the garment details. To enable this capability, we built a dataset containing unpaired reference images for training. Extensive experiments on public benchmarks demonstrate that RefTon achieves competitive or superior performance compared to state-of-the-art methods, while maintaining a simple and efficient person-to-person design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RefTon adds unpaired reference shots to a Flux try-on pipeline for simpler garment detail handling, but the abstract gives no metrics or integration details so the gains are hard to judge.

read the letter

Hi, the main takeaway is that RefTon keeps the try-on pipeline simple by skipping body parsing and warped masks, then adds unpaired reference images of the garment on other people to help with texture and details inside a Flux model. They built a custom dataset for this and say it leads to competitive or better results on benchmarks while staying person-to-person only. That reference idea is the clearest new piece relative to prior work that leans on structural inputs or extra branches. It draws from how people actually choose clothes, which is a reasonable practical angle for e-commerce use cases. The design choice to avoid auxiliary components is cleaner on paper and could reduce preprocessing steps if it works reliably. The soft spot is that the abstract states the performance claim without showing numbers, ablations, or error breakdowns, so it is difficult to tell whether the references deliver consistent improvements or just sometimes help. The stress-test point about the unspecified conditioning path is on target here; without knowing if references go through cross-attention, a separate encoder, or inference-only tricks, it is unclear how well they handle pose, lighting, or body-shape differences. If the full paper has those mechanics spelled out plus quantitative tables, that would strengthen the case. This is mainly for researchers and engineers working on virtual try-on for fashion applications who want a lighter pipeline. A reader already familiar with Flux-based generation would see the incremental tweak and could test the reference trick themselves. I would send it to peer review because the core simplification plus the dataset effort is concrete enough to merit referee time, even if the experiments need tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces RefTon, a Flux-based person-to-person virtual try-on framework that generates results directly from a source person image and target garment image. It avoids complex auxiliary inputs such as body parsing or warped masks and instead incorporates additional unpaired reference images (the target garment worn on different individuals) to refine texture alignment and garment details. A custom dataset of such references was constructed for training, and the abstract states that extensive experiments on public benchmarks show competitive or superior performance to state-of-the-art methods while preserving a simple design.

Significance. If the empirical claims hold and the reference integration proves robust, the work could meaningfully simplify virtual try-on pipelines by reducing dependence on structural guidance and auxiliary branches. The use of unpaired person-shot references to mimic human clothing selection offers a plausible route to better detail preservation, which would be valuable for practical e-commerce applications if the gains are reproducible and artifact-free across pose and lighting variations.

major comments (2)

[Abstract] Abstract: the claim that RefTon 'achieves competitive or superior performance compared to state-of-the-art methods' is unsupported by any quantitative metrics, tables, ablation studies, or error analysis, which is load-bearing for the central empirical contribution.
[Method] Method description (reference-image usage): the conditioning pathway for the unpaired reference images is unspecified (e.g., whether they are encoded in a separate branch, injected through cross-attention, concatenated at the input, or used only at inference), which is critical to verify that they supply consistent texture guidance rather than introducing artifacts when body shape, pose, or illumination differ from the target person.

minor comments (1)

[Abstract] Abstract: the phrase 'person-to-person design' would benefit from a short clarification distinguishing it from standard garment-to-person try-on formulations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and revised the paper to address the concerns regarding empirical support and methodological clarity. Below we provide point-by-point responses.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that RefTon 'achieves competitive or superior performance compared to state-of-the-art methods' is unsupported by any quantitative metrics, tables, ablation studies, or error analysis, which is load-bearing for the central empirical contribution.

Authors: We appreciate this observation. The manuscript describes extensive experiments on public benchmarks and includes qualitative comparisons, but we agree that the abstract claim would be more robust with explicit quantitative backing. In the revised manuscript we have added a dedicated results table reporting standard virtual try-on metrics (FID, LPIPS, SSIM) along with a user study, which supports the statement of competitive or superior performance. We have also updated the abstract to reference these quantitative findings. revision: yes
Referee: [Method] Method description (reference-image usage): the conditioning pathway for the unpaired reference images is unspecified (e.g., whether they are encoded in a separate branch, injected through cross-attention, concatenated at the input, or used only at inference), which is critical to verify that they supply consistent texture guidance rather than introducing artifacts when body shape, pose, or illumination differ from the target person.

Authors: We agree that the integration mechanism for the reference images requires explicit description to ensure reproducibility and to clarify robustness under varying conditions. The original submission outlined the overall person-to-person pipeline but did not detail the conditioning pathway. In the revised Section 3 we now specify that each unpaired reference image is passed through the Flux VAE encoder, after which its latent features are injected into the transformer blocks via dedicated cross-attention layers (separate from the garment and person conditioning). This allows the model to selectively attend to texture and detail cues while remaining robust to differences in body shape, pose, and lighting, as further supported by the added ablation studies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential reductions

full rationale

The RefTon paper describes a Flux-based virtual try-on architecture that incorporates unpaired reference images for texture guidance and reports competitive performance via experiments on public benchmarks. No equations, first-principles derivations, or mathematical predictions appear in the provided text. Claims rest on architectural choices (direct person-to-person generation without body parsing or auxiliary branches) and a custom dataset, evaluated externally rather than fitted or defined circularly against the target results. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing manner that reduces the central contribution to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes standard capabilities of flux-based generative models for image synthesis and that reference images can be effectively leveraged for texture guidance; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Flux-based diffusion models can directly synthesize realistic try-on images from source person and garment inputs when supplemented with unpaired references.
The entire pipeline is built on this generative modeling premise without additional structural conditioning.

pith-pipeline@v0.9.0 · 5707 in / 1136 out tokens · 31226 ms · 2026-05-18T01:36:54.361278+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds
cs.LG 2026-04 unverdicted novelty 5.0

Aligning the DDIM forward diffusion process with flow-matching manifold evolution enables high-quality generation without time conditioning, and class-conditional synthesis is possible with an unconditional denoiser b...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

work page 2020
[2]

Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239, 2020. URLhttps://api.semanticscholar.org/CorpusID:219955663

work page internal anchor Pith review Pith/arXiv arXiv 2006
[3]

Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. URLhttps://api.semanticscholar.org/CorpusID:245335280

work page 2022
[4]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021

work page 2021
[5]

Viton: An image-based virtual try-on network

Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018

work page 2018
[6]

Toward characteristic- preserving image-based virtual try-on network

Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward characteristic- preserving image-based virtual try-on network. InProceedings of the European conference on computer vision (ECCV), pages 589–604, 2018

work page 2018
[7]

Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment

Mengting Chen, Xi Chen, Zhonghua Zhai, Chen Ju, Xuewen Hong, Jinsong Lan, and Shuai Xiao. Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment. InEuropean Conference on Computer Vision, pages 124–142. Springer, 2024

work page 2024
[8]

Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on

Yuhao Xu, Tao Gu, Weifeng Chen, and Chengcai Chen. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. InAAAI Conference on Artificial Intelligence, 2024. URL https://api. semanticscholar.org/CorpusID:268247604

work page 2024
[9]

Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on

Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024

work page 2024
[10]

Tryondiffusion: A tale of two unets

Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4606–4615, 2023

work page 2023
[11]

Magicanimate: Temporally consistent human image animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024

work page 2024
[12]

Improving diffusion models for authentic virtual try-on in the wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for authentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024

work page 2024
[13]

Promptdresser: Improving the quality and control- lability of virtual try-on via generative textual prompt and prompt-aware mask.arXiv preprint arXiv:2412.16978, 2024

Jeongho Kim, Hoiyeong Jin, Sunghyun Park, and Jaegul Choo. Promptdresser: Improving the quality and control- lability of virtual try-on via generative textual prompt and prompt-aware mask.arXiv preprint arXiv:2412.16978, 2024

work page arXiv 2024
[14]

Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886,

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886, 2024

work page arXiv 2024
[15]

Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation.arXiv preprint arXiv:2501.11325, 2025

Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation.arXiv preprint arXiv:2501.11325, 2025

work page arXiv 2025
[16]

Omnivton: Training-free universal virtual try-on.arXiv preprint arXiv:2507.15037, 2025

Zhaotong Yang, Yuhui Li, Shengfeng He, Xinzhe Li, Yangyang Xu, Junyu Dong, and Yong Du. Omnivton: Training-free universal virtual try-on.arXiv preprint arXiv:2507.15037, 2025. 11 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT

work page arXiv 2025
[17]

Densepose: Dense human pose estimation in the wild

Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018

work page 2018
[18]

Deeppose: Human pose estimation via deep neural networks

Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014

work page 2014
[19]

Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

work page 2019
[20]

Realtime multi-person 2d pose estimation using part affinity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. InCVPR, 2017

work page 2017
[21]

Convolutional pose machines

Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. InCVPR, 2016

work page 2016
[22]

Self-correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self-correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. doi:10.1109/TPAMI.2020.3048039

work page doi:10.1109/tpami.2020.3048039 2020
[23]

Towards unified human parsing and pose estimation

Jian Dong, Qiang Chen, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Towards unified human parsing and pose estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 843–850, 2014

work page 2014
[24]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[25]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.007...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Dress code: High-resolution multi-category virtual try-on

Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High-resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022

work page 2022
[27]

Vivid: Video virtual try-on using diffusion models,

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng-Jun Zha. Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794, 2024

work page arXiv 2024
[28]

Virtually trying on new clothing with arbitrary poses

Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang Nie. Virtually trying on new clothing with arbitrary poses. InProceedings of the 27th ACM international conference on multimedia, pages 266–274, 2019

work page 2019
[29]

Imagdressing- v1: Customizable virtual dressing

Fei Shen, Xin Jiang, Xin He, Hu Ye, Cong Wang, Xiaoyu Du, Zechao Li, and Jinghui Tang. Imagdressing- v1: Customizable virtual dressing. InAAAI Conference on Artificial Intelligence, 2024. URL https://api. semanticscholar.org/CorpusID:271244829

work page 2024
[30]

Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models.arXiv preprint arXiv:2407.10625, 2024

Zijian He, Peixin Chen, Guangrun Wang, Guanbin Li, Philip HS Torr, and Liang Lin. Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models.arXiv preprint arXiv:2407.10625, 2024

work page arXiv 2024
[31]

Hf-vton: High-fidelity virtual try-on via consistent geometric and semantic alignment.arXiv preprint arXiv:2505.19638, 2025

Ming Meng, Qi Dong, Jiajie Li, Zhe Zhu, Xingyu Wang, Zhaoxin Fan, Wei Zhao, and Wenjun Wu. Hf-vton: High-fidelity virtual try-on via consistent geometric and semantic alignment.arXiv preprint arXiv:2505.19638, 2025

work page arXiv 2025
[32]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[33]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015
[34]

Score- based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=PxTIG12RRHS

work page 2021
[35]

Density estimation using Real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[36]

NICE: Non-linear Independent Components Estimation

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation.arXiv preprint arXiv:1410.8516, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018

Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018. 12 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT

work page 2018
[38]

Neural ordinary differential equations

Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Kristjanson Duvenaud. Neural ordinary differential equations. InNeural Information Processing Systems, 2018. URL https://api.semanticscholar.org/ CorpusID:49310446

work page 2018
[39]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.ArXiv, abs/2210.02747, 2022. URL https://api.semanticscholar.org/CorpusID:252734897

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Ladi- vton: Latent diffusion textual-inversion enhanced virtual try-on

Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi- vton: Latent diffusion textual-inversion enhanced virtual try-on. InProceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023

work page 2023
[41]

Diffusionclip: Text-guided diffusion models for robust image manipulation

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2426–2435, 2022

work page 2022
[42]

Taming the power of diffusion models for high-quality virtual try-on with appearance flow

Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. InProceedings of the 31st ACM International Conference on Multimedia, pages 7599–7607, 2023

work page 2023
[43]

Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yuxiong Wu, and Bin Wang. Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

work page arXiv 2025
[44]

Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks

Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Jiaming Liu, and Chuang Zhang. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19085–19096, 2025

work page 2025
[45]

Enhancing person-to-person virtual try-on with multi-garment virtual try-off.arXiv preprint arXiv:2504.13078, 2025

Riza Velioglu, Petra Bevandic, Robin Chan, and Barbara Hammer. Enhancing person-to-person virtual try-on with multi-garment virtual try-off.arXiv preprint arXiv:2504.13078, 2025

work page arXiv 2025
[46]

Viton-gun: Person-to-person virtual try-on via garment unwrapping.IEEE Transactions on Visualization and Computer Graphics, 31(10):7740–7751, 2025

Nannan Zhang, Zhenyu Xie, Zhengwentai Sun, Hairui Zhu, Zirong Jin, Nan Xiang, Xiaoguang Han, and Song Wu. Viton-gun: Person-to-person virtual try-on via garment unwrapping.IEEE Transactions on Visualization and Computer Graphics, 31(10):7740–7751, 2025. doi:10.1109/TVCG.2025.3550776

work page doi:10.1109/tvcg.2025.3550776 2025
[47]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[48]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[49]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[50]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[51]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[54]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Understanding ssim,

Jim Nilsson and Tomas Akenine-Möller. Understanding ssim.arXiv preprint arXiv:2006.13846, 2020

work page arXiv 2006
[56]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[57]

pytorch-fid: FID Score for PyTorch

Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0

work page 2020
[58]

Demystifying MMD GANs

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[59]

keep the {target cloth} cloth unchanged

Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tongtong Wang, and An-An Liu. Cat-dm: Controllable accelerated virtual try-on with diffusion model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8372–8382, 2024. 13 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT A Appendix...

work page 2024

[1] [1]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

work page 2020

[2] [2]

Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239, 2020. URLhttps://api.semanticscholar.org/CorpusID:219955663

work page internal anchor Pith review Pith/arXiv arXiv 2006

[3] [3]

Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. URLhttps://api.semanticscholar.org/CorpusID:245335280

work page 2022

[4] [4]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021

work page 2021

[5] [5]

Viton: An image-based virtual try-on network

Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018

work page 2018

[6] [6]

Toward characteristic- preserving image-based virtual try-on network

Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward characteristic- preserving image-based virtual try-on network. InProceedings of the European conference on computer vision (ECCV), pages 589–604, 2018

work page 2018

[7] [7]

Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment

Mengting Chen, Xi Chen, Zhonghua Zhai, Chen Ju, Xuewen Hong, Jinsong Lan, and Shuai Xiao. Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment. InEuropean Conference on Computer Vision, pages 124–142. Springer, 2024

work page 2024

[8] [8]

Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on

Yuhao Xu, Tao Gu, Weifeng Chen, and Chengcai Chen. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. InAAAI Conference on Artificial Intelligence, 2024. URL https://api. semanticscholar.org/CorpusID:268247604

work page 2024

[9] [9]

Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on

Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024

work page 2024

[10] [10]

Tryondiffusion: A tale of two unets

Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4606–4615, 2023

work page 2023

[11] [11]

Magicanimate: Temporally consistent human image animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024

work page 2024

[12] [12]

Improving diffusion models for authentic virtual try-on in the wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for authentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024

work page 2024

[13] [13]

Promptdresser: Improving the quality and control- lability of virtual try-on via generative textual prompt and prompt-aware mask.arXiv preprint arXiv:2412.16978, 2024

Jeongho Kim, Hoiyeong Jin, Sunghyun Park, and Jaegul Choo. Promptdresser: Improving the quality and control- lability of virtual try-on via generative textual prompt and prompt-aware mask.arXiv preprint arXiv:2412.16978, 2024

work page arXiv 2024

[14] [14]

Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886,

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886, 2024

work page arXiv 2024

[15] [15]

Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation.arXiv preprint arXiv:2501.11325, 2025

Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation.arXiv preprint arXiv:2501.11325, 2025

work page arXiv 2025

[16] [16]

Omnivton: Training-free universal virtual try-on.arXiv preprint arXiv:2507.15037, 2025

Zhaotong Yang, Yuhui Li, Shengfeng He, Xinzhe Li, Yangyang Xu, Junyu Dong, and Yong Du. Omnivton: Training-free universal virtual try-on.arXiv preprint arXiv:2507.15037, 2025. 11 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT

work page arXiv 2025

[17] [17]

Densepose: Dense human pose estimation in the wild

Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018

work page 2018

[18] [18]

Deeppose: Human pose estimation via deep neural networks

Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014

work page 2014

[19] [19]

Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

work page 2019

[20] [20]

Realtime multi-person 2d pose estimation using part affinity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. InCVPR, 2017

work page 2017

[21] [21]

Convolutional pose machines

Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. InCVPR, 2016

work page 2016

[22] [22]

Self-correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self-correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. doi:10.1109/TPAMI.2020.3048039

work page doi:10.1109/tpami.2020.3048039 2020

[23] [23]

Towards unified human parsing and pose estimation

Jian Dong, Qiang Chen, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Towards unified human parsing and pose estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 843–850, 2014

work page 2014

[24] [24]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023

[25] [25]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.007...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Dress code: High-resolution multi-category virtual try-on

Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High-resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022

work page 2022

[27] [27]

Vivid: Video virtual try-on using diffusion models,

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng-Jun Zha. Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794, 2024

work page arXiv 2024

[28] [28]

Virtually trying on new clothing with arbitrary poses

Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang Nie. Virtually trying on new clothing with arbitrary poses. InProceedings of the 27th ACM international conference on multimedia, pages 266–274, 2019

work page 2019

[29] [29]

Imagdressing- v1: Customizable virtual dressing

Fei Shen, Xin Jiang, Xin He, Hu Ye, Cong Wang, Xiaoyu Du, Zechao Li, and Jinghui Tang. Imagdressing- v1: Customizable virtual dressing. InAAAI Conference on Artificial Intelligence, 2024. URL https://api. semanticscholar.org/CorpusID:271244829

work page 2024

[30] [30]

Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models.arXiv preprint arXiv:2407.10625, 2024

Zijian He, Peixin Chen, Guangrun Wang, Guanbin Li, Philip HS Torr, and Liang Lin. Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models.arXiv preprint arXiv:2407.10625, 2024

work page arXiv 2024

[31] [31]

Hf-vton: High-fidelity virtual try-on via consistent geometric and semantic alignment.arXiv preprint arXiv:2505.19638, 2025

Ming Meng, Qi Dong, Jiajie Li, Zhe Zhu, Xingyu Wang, Zhaoxin Fan, Wei Zhao, and Wenjun Wu. Hf-vton: High-fidelity virtual try-on via consistent geometric and semantic alignment.arXiv preprint arXiv:2505.19638, 2025

work page arXiv 2025

[32] [32]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024

[33] [33]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015

[34] [34]

Score- based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=PxTIG12RRHS

work page 2021

[35] [35]

Density estimation using Real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[36] [36]

NICE: Non-linear Independent Components Estimation

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation.arXiv preprint arXiv:1410.8516, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[37] [37]

Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018

Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018. 12 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT

work page 2018

[38] [38]

Neural ordinary differential equations

Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Kristjanson Duvenaud. Neural ordinary differential equations. InNeural Information Processing Systems, 2018. URL https://api.semanticscholar.org/ CorpusID:49310446

work page 2018

[39] [39]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.ArXiv, abs/2210.02747, 2022. URL https://api.semanticscholar.org/CorpusID:252734897

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

Ladi- vton: Latent diffusion textual-inversion enhanced virtual try-on

Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi- vton: Latent diffusion textual-inversion enhanced virtual try-on. InProceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023

work page 2023

[41] [41]

Diffusionclip: Text-guided diffusion models for robust image manipulation

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2426–2435, 2022

work page 2022

[42] [42]

Taming the power of diffusion models for high-quality virtual try-on with appearance flow

Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. InProceedings of the 31st ACM International Conference on Multimedia, pages 7599–7607, 2023

work page 2023

[43] [43]

Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yuxiong Wu, and Bin Wang. Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

work page arXiv 2025

[44] [44]

Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks

Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Jiaming Liu, and Chuang Zhang. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19085–19096, 2025

work page 2025

[45] [45]

Enhancing person-to-person virtual try-on with multi-garment virtual try-off.arXiv preprint arXiv:2504.13078, 2025

Riza Velioglu, Petra Bevandic, Robin Chan, and Barbara Hammer. Enhancing person-to-person virtual try-on with multi-garment virtual try-off.arXiv preprint arXiv:2504.13078, 2025

work page arXiv 2025

[46] [46]

Viton-gun: Person-to-person virtual try-on via garment unwrapping.IEEE Transactions on Visualization and Computer Graphics, 31(10):7740–7751, 2025

Nannan Zhang, Zhenyu Xie, Zhengwentai Sun, Hairui Zhu, Zirong Jin, Nan Xiang, Xiaoguang Han, and Song Wu. Viton-gun: Person-to-person virtual try-on via garment unwrapping.IEEE Transactions on Visualization and Computer Graphics, 31(10):7740–7751, 2025. doi:10.1109/TVCG.2025.3550776

work page doi:10.1109/tvcg.2025.3550776 2025

[47] [47]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[48] [48]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[49] [49]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[50] [50]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024

[51] [51]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [53]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022

[54] [54]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Understanding ssim,

Jim Nilsson and Tomas Akenine-Möller. Understanding ssim.arXiv preprint arXiv:2006.13846, 2020

work page arXiv 2006

[56] [56]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018

[57] [57]

pytorch-fid: FID Score for PyTorch

Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0

work page 2020

[58] [58]

Demystifying MMD GANs

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[59] [59]

keep the {target cloth} cloth unchanged

Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tongtong Wang, and An-An Liu. Cat-dm: Controllable accelerated virtual try-on with diffusion model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8372–8382, 2024. 13 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT A Appendix...

work page 2024