pith. sign in

arxiv: 2511.00956 · v6 · submitted 2025-11-02 · 💻 cs.CV

RefTon: Reference person shot assist virtual Try-on

Pith reviewed 2026-05-18 01:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords virtual try-onunpaired referencesgarment transfertexture alignmentperson-to-person generationimage synthesisreference-guided editingflux model
0
0 comments X

The pith

RefTon generates virtual try-on results directly from source person and target garment images by adding unpaired reference photos for texture refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

RefTon is a virtual try-on framework that takes a source person image and a target garment image and directly outputs the person wearing the garment. It avoids standard complex steps such as body parsing or mask warping by using a flux-based model with no auxiliary branches for handling inputs. The central addition is a set of unpaired reference images showing the same garment on other people, which supply guidance to align textures and preserve details. A new dataset of these references supports training the system. Experiments on public benchmarks show results that match or exceed those of current leading methods while keeping the overall design simple and efficient.

Core claim

RefTon is a flux-based person-to-person virtual try-on framework that directly generates try-on results from a source image and a target garment without structural guidance or auxiliary components. It leverages additional unpaired reference images of the target garment worn on different individuals to refine texture alignment and maintain garment details, enabled by a newly built dataset of such references. Extensive experiments on public benchmarks demonstrate competitive or superior performance compared to state-of-the-art methods while preserving a simple and efficient design.

What carries the argument

Unpaired reference images of the target garment on different individuals, used to guide texture alignment and detail preservation inside a direct flux-based person-to-person generation process.

If this is right

  • The virtual try-on pipeline simplifies by removing the need for body parsing, warped masks, or separate input-handling branches.
  • Garment details are preserved more effectively through direct reference-based refinement during generation.
  • Training can proceed with unpaired data thanks to the constructed reference dataset.
  • Competitive benchmark performance holds even with the streamlined person-to-person structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reference-image approach could extend to other conditional image generation tasks that benefit from cross-example consistency cues.
  • Reduced architectural complexity may support faster inference or easier integration into mobile virtual try-on applications.
  • Selecting or synthesizing the most useful reference images automatically could become a natural next step for robustness.
  • The method may lower data requirements for similar garment-transfer problems by relying on unpaired rather than strictly paired examples.

Load-bearing premise

Unpaired reference images of the target garment on different individuals supply reliable and consistent guidance for texture alignment and detail preservation without introducing new artifacts.

What would settle it

Visible texture mismatches, lost garment details, or new artifacts appearing in try-on outputs on patterned clothing or extreme poses when reference images are included would show the guidance is not reliable.

Figures

Figures reproduced from arXiv: 2511.00956 by Bo Cheng, Dawei Leng, Dengyang Jiang, Leibucha Wu, Liuzhuozheng Li, Shanyuan Liu, Yue Gong, Yuhang Ma, Yuhui Yin, Zanyi Wang.

Figure 1
Figure 1. Figure 1: In-the-wild try-on results generated by our RefVTON model with a p2p style, trained on person and garment [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The effect of using reference images for the virtual [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of our two-stage training strategy: (a) In the first stage, which follows a similar paradigm to [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Adaptation of a three-channel position index: the first channel encodes different conditional inputs, while the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The overall pipeline of generating the reference images. We first generate the appearance descriptions using [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on the VITON dataset. and the model is trained following the pipeline in Fig. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on the DressCode dataset., and the model is trained following the pipeline in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of the ablation study across different settings. “Ref.” denotes that a reference image is provided, while “MF” indicates mask-free inputs using the original person image instead of a masked agnostic image. We conduct an ablation study to examine our model under four settings (w/&w/o mask, w/&w/o Ref.). As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Text prompts from the Outfit and Action Description Bank. To ensure the model edits only the person while [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sample reference images generated by our reference data generation pipeline. The editing model takes [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative paired results in VITON-HD dataset with complex patterns on clothes. “reference” denotes that a reference image is provided. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative paired results in VITON-HD dataset with complex structure on clothes. “reference” denotes that a reference image is provided. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results of upper-body sub-set in Dresscode dataset unpaired setting. “reference” denotes that a reference image is provided, while “MF” indicates mask-free inputs using the original person image instead of a masked agnostic image. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative results of lower-body sub-set in Dresscode dataset unpaired setting. “reference” denotes that a reference image is provided, while “MF” indicates mask-free inputs using the original person image instead of a masked agnostic image. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative results of dresses sub-set in Dresscode dataset unpaired setting. “reference” denotes that a reference image is provided, while “MF” indicates mask-free inputs using the original person image instead of a masked agnostic image. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
read the original abstract

We introduce RefTon, a flux-based person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. Unlike conventional approaches that rely on complex auxiliary inputs such as body parsing and warped mask or require finely designed extract branches to process various input conditions, RefTon streamlines the process by directly generating try-on results from a source image and a target garment, without the need for structural guidance or auxiliary components to handle diverse inputs. Moreover, inspired by human clothing selection behavior, RefTon leverages additional reference images (the target garment worn on different individuals) to provide powerful guidance for refining texture alignment and maintaining the garment details. To enable this capability, we built a dataset containing unpaired reference images for training. Extensive experiments on public benchmarks demonstrate that RefTon achieves competitive or superior performance compared to state-of-the-art methods, while maintaining a simple and efficient person-to-person design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces RefTon, a Flux-based person-to-person virtual try-on framework that generates results directly from a source person image and target garment image. It avoids complex auxiliary inputs such as body parsing or warped masks and instead incorporates additional unpaired reference images (the target garment worn on different individuals) to refine texture alignment and garment details. A custom dataset of such references was constructed for training, and the abstract states that extensive experiments on public benchmarks show competitive or superior performance to state-of-the-art methods while preserving a simple design.

Significance. If the empirical claims hold and the reference integration proves robust, the work could meaningfully simplify virtual try-on pipelines by reducing dependence on structural guidance and auxiliary branches. The use of unpaired person-shot references to mimic human clothing selection offers a plausible route to better detail preservation, which would be valuable for practical e-commerce applications if the gains are reproducible and artifact-free across pose and lighting variations.

major comments (2)
  1. [Abstract] Abstract: the claim that RefTon 'achieves competitive or superior performance compared to state-of-the-art methods' is unsupported by any quantitative metrics, tables, ablation studies, or error analysis, which is load-bearing for the central empirical contribution.
  2. [Method] Method description (reference-image usage): the conditioning pathway for the unpaired reference images is unspecified (e.g., whether they are encoded in a separate branch, injected through cross-attention, concatenated at the input, or used only at inference), which is critical to verify that they supply consistent texture guidance rather than introducing artifacts when body shape, pose, or illumination differ from the target person.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'person-to-person design' would benefit from a short clarification distinguishing it from standard garment-to-person try-on formulations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and revised the paper to address the concerns regarding empirical support and methodological clarity. Below we provide point-by-point responses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that RefTon 'achieves competitive or superior performance compared to state-of-the-art methods' is unsupported by any quantitative metrics, tables, ablation studies, or error analysis, which is load-bearing for the central empirical contribution.

    Authors: We appreciate this observation. The manuscript describes extensive experiments on public benchmarks and includes qualitative comparisons, but we agree that the abstract claim would be more robust with explicit quantitative backing. In the revised manuscript we have added a dedicated results table reporting standard virtual try-on metrics (FID, LPIPS, SSIM) along with a user study, which supports the statement of competitive or superior performance. We have also updated the abstract to reference these quantitative findings. revision: yes

  2. Referee: [Method] Method description (reference-image usage): the conditioning pathway for the unpaired reference images is unspecified (e.g., whether they are encoded in a separate branch, injected through cross-attention, concatenated at the input, or used only at inference), which is critical to verify that they supply consistent texture guidance rather than introducing artifacts when body shape, pose, or illumination differ from the target person.

    Authors: We agree that the integration mechanism for the reference images requires explicit description to ensure reproducibility and to clarify robustness under varying conditions. The original submission outlined the overall person-to-person pipeline but did not detail the conditioning pathway. In the revised Section 3 we now specify that each unpaired reference image is passed through the Flux VAE encoder, after which its latent features are injected into the transformer blocks via dedicated cross-attention layers (separate from the garment and person conditioning). This allows the model to selectively attend to texture and detail cues while remaining robust to differences in body shape, pose, and lighting, as further supported by the added ablation studies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential reductions

full rationale

The RefTon paper describes a Flux-based virtual try-on architecture that incorporates unpaired reference images for texture guidance and reports competitive performance via experiments on public benchmarks. No equations, first-principles derivations, or mathematical predictions appear in the provided text. Claims rest on architectural choices (direct person-to-person generation without body parsing or auxiliary branches) and a custom dataset, evaluated externally rather than fitted or defined circularly against the target results. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing manner that reduces the central contribution to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes standard capabilities of flux-based generative models for image synthesis and that reference images can be effectively leveraged for texture guidance; no explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Flux-based diffusion models can directly synthesize realistic try-on images from source person and garment inputs when supplemented with unpaired references.
    The entire pipeline is built on this generative modeling premise without additional structural conditioning.

pith-pipeline@v0.9.0 · 5707 in / 1136 out tokens · 31226 ms · 2026-05-18T01:36:54.361278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...

  2. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  3. Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds

    cs.LG 2026-04 unverdicted novelty 5.0

    Aligning the DDIM forward diffusion process with flow-matching manifold evolution enables high-quality generation without time conditioning, and class-conditional synthesis is possible with an unconditional denoiser b...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 2 Pith papers · 11 internal anchors

  1. [1]

    Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

  2. [2]

    Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239, 2020. URLhttps://api.semanticscholar.org/CorpusID:219955663

  3. [3]

    Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. URLhttps://api.semanticscholar.org/CorpusID:245335280

  4. [4]

    Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

    Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021

  5. [5]

    Viton: An image-based virtual try-on network

    Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018

  6. [6]

    Toward characteristic- preserving image-based virtual try-on network

    Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward characteristic- preserving image-based virtual try-on network. InProceedings of the European conference on computer vision (ECCV), pages 589–604, 2018

  7. [7]

    Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment

    Mengting Chen, Xi Chen, Zhonghua Zhai, Chen Ju, Xuewen Hong, Jinsong Lan, and Shuai Xiao. Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment. InEuropean Conference on Computer Vision, pages 124–142. Springer, 2024

  8. [8]

    Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on

    Yuhao Xu, Tao Gu, Weifeng Chen, and Chengcai Chen. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. InAAAI Conference on Artificial Intelligence, 2024. URL https://api. semanticscholar.org/CorpusID:268247604

  9. [9]

    Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on

    Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024

  10. [10]

    Tryondiffusion: A tale of two unets

    Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4606–4615, 2023

  11. [11]

    Magicanimate: Temporally consistent human image animation using diffusion model

    Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024

  12. [12]

    Improving diffusion models for authentic virtual try-on in the wild

    Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for authentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024

  13. [13]

    Promptdresser: Improving the quality and control- lability of virtual try-on via generative textual prompt and prompt-aware mask.arXiv preprint arXiv:2412.16978, 2024

    Jeongho Kim, Hoiyeong Jin, Sunghyun Park, and Jaegul Choo. Promptdresser: Improving the quality and control- lability of virtual try-on via generative textual prompt and prompt-aware mask.arXiv preprint arXiv:2412.16978, 2024

  14. [14]

    Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886,

    Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886, 2024

  15. [15]

    Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation.arXiv preprint arXiv:2501.11325, 2025

    Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation.arXiv preprint arXiv:2501.11325, 2025

  16. [16]

    Omnivton: Training-free universal virtual try-on.arXiv preprint arXiv:2507.15037, 2025

    Zhaotong Yang, Yuhui Li, Shengfeng He, Xinzhe Li, Yangyang Xu, Junyu Dong, and Yong Du. Omnivton: Training-free universal virtual try-on.arXiv preprint arXiv:2507.15037, 2025. 11 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT

  17. [17]

    Densepose: Dense human pose estimation in the wild

    Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018

  18. [18]

    Deeppose: Human pose estimation via deep neural networks

    Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014

  19. [19]

    Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

  20. [20]

    Realtime multi-person 2d pose estimation using part affinity fields

    Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. InCVPR, 2017

  21. [21]

    Convolutional pose machines

    Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. InCVPR, 2016

  22. [22]

    Self-correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

    Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self-correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. doi:10.1109/TPAMI.2020.3048039

  23. [23]

    Towards unified human parsing and pose estimation

    Jian Dong, Qiang Chen, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Towards unified human parsing and pose estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 843–850, 2014

  24. [24]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  25. [25]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.007...

  26. [26]

    Dress code: High-resolution multi-category virtual try-on

    Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High-resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022

  27. [27]

    Vivid: Video virtual try-on using diffusion models,

    Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng-Jun Zha. Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794, 2024

  28. [28]

    Virtually trying on new clothing with arbitrary poses

    Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang Nie. Virtually trying on new clothing with arbitrary poses. InProceedings of the 27th ACM international conference on multimedia, pages 266–274, 2019

  29. [29]

    Imagdressing- v1: Customizable virtual dressing

    Fei Shen, Xin Jiang, Xin He, Hu Ye, Cong Wang, Xiaoyu Du, Zechao Li, and Jinghui Tang. Imagdressing- v1: Customizable virtual dressing. InAAAI Conference on Artificial Intelligence, 2024. URL https://api. semanticscholar.org/CorpusID:271244829

  30. [30]

    Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models.arXiv preprint arXiv:2407.10625, 2024

    Zijian He, Peixin Chen, Guangrun Wang, Guanbin Li, Philip HS Torr, and Liang Lin. Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models.arXiv preprint arXiv:2407.10625, 2024

  31. [31]

    Hf-vton: High-fidelity virtual try-on via consistent geometric and semantic alignment.arXiv preprint arXiv:2505.19638, 2025

    Ming Meng, Qi Dong, Jiajie Li, Zhe Zhu, Xingyu Wang, Zhaoxin Fan, Wei Zhao, and Wenjun Wu. Hf-vton: High-fidelity virtual try-on via consistent geometric and semantic alignment.arXiv preprint arXiv:2505.19638, 2025

  32. [32]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  33. [33]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

  34. [34]

    Score- based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=PxTIG12RRHS

  35. [35]

    Density estimation using Real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803, 2016

  36. [36]

    NICE: Non-linear Independent Components Estimation

    Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation.arXiv preprint arXiv:1410.8516, 2014

  37. [37]

    Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018

    Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018. 12 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT

  38. [38]

    Neural ordinary differential equations

    Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Kristjanson Duvenaud. Neural ordinary differential equations. InNeural Information Processing Systems, 2018. URL https://api.semanticscholar.org/ CorpusID:49310446

  39. [39]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.ArXiv, abs/2210.02747, 2022. URL https://api.semanticscholar.org/CorpusID:252734897

  40. [40]

    Ladi- vton: Latent diffusion textual-inversion enhanced virtual try-on

    Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi- vton: Latent diffusion textual-inversion enhanced virtual try-on. InProceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023

  41. [41]

    Diffusionclip: Text-guided diffusion models for robust image manipulation

    Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2426–2435, 2022

  42. [42]

    Taming the power of diffusion models for high-quality virtual try-on with appearance flow

    Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. InProceedings of the 31st ACM International Conference on Multimedia, pages 7599–7607, 2023

  43. [43]

    Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

    Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yuxiong Wu, and Bin Wang. Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025

  44. [44]

    Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks

    Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Jiaming Liu, and Chuang Zhang. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19085–19096, 2025

  45. [45]

    Enhancing person-to-person virtual try-on with multi-garment virtual try-off.arXiv preprint arXiv:2504.13078, 2025

    Riza Velioglu, Petra Bevandic, Robin Chan, and Barbara Hammer. Enhancing person-to-person virtual try-on with multi-garment virtual try-off.arXiv preprint arXiv:2504.13078, 2025

  46. [46]

    Viton-gun: Person-to-person virtual try-on via garment unwrapping.IEEE Transactions on Visualization and Computer Graphics, 31(10):7740–7751, 2025

    Nannan Zhang, Zhenyu Xie, Zhengwentai Sun, Hairui Zhu, Zirong Jin, Nan Xiang, Xiaoguang Han, and Song Wu. Viton-gun: Person-to-person virtual try-on via garment unwrapping.IEEE Transactions on Visualization and Computer Graphics, 31(10):7740–7751, 2025. doi:10.1109/TVCG.2025.3550776

  47. [47]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  48. [48]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  49. [49]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  50. [50]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  51. [51]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

  52. [52]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  53. [53]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  54. [54]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  55. [55]

    Understanding ssim,

    Jim Nilsson and Tomas Akenine-Möller. Understanding ssim.arXiv preprint arXiv:2006.13846, 2020

  56. [56]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  57. [57]

    pytorch-fid: FID Score for PyTorch

    Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0

  58. [58]

    Demystifying MMD GANs

    Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

  59. [59]

    keep the {target cloth} cloth unchanged

    Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tongtong Wang, and An-An Liu. Cat-dm: Controllable accelerated virtual try-on with diffusion model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8372–8382, 2024. 13 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT A Appendix...