RefTon: Reference person shot assist virtual Try-on
Pith reviewed 2026-05-18 01:36 UTC · model grok-4.3
The pith
RefTon generates virtual try-on results directly from source person and target garment images by adding unpaired reference photos for texture refinement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RefTon is a flux-based person-to-person virtual try-on framework that directly generates try-on results from a source image and a target garment without structural guidance or auxiliary components. It leverages additional unpaired reference images of the target garment worn on different individuals to refine texture alignment and maintain garment details, enabled by a newly built dataset of such references. Extensive experiments on public benchmarks demonstrate competitive or superior performance compared to state-of-the-art methods while preserving a simple and efficient design.
What carries the argument
Unpaired reference images of the target garment on different individuals, used to guide texture alignment and detail preservation inside a direct flux-based person-to-person generation process.
If this is right
- The virtual try-on pipeline simplifies by removing the need for body parsing, warped masks, or separate input-handling branches.
- Garment details are preserved more effectively through direct reference-based refinement during generation.
- Training can proceed with unpaired data thanks to the constructed reference dataset.
- Competitive benchmark performance holds even with the streamlined person-to-person structure.
Where Pith is reading between the lines
- The reference-image approach could extend to other conditional image generation tasks that benefit from cross-example consistency cues.
- Reduced architectural complexity may support faster inference or easier integration into mobile virtual try-on applications.
- Selecting or synthesizing the most useful reference images automatically could become a natural next step for robustness.
- The method may lower data requirements for similar garment-transfer problems by relying on unpaired rather than strictly paired examples.
Load-bearing premise
Unpaired reference images of the target garment on different individuals supply reliable and consistent guidance for texture alignment and detail preservation without introducing new artifacts.
What would settle it
Visible texture mismatches, lost garment details, or new artifacts appearing in try-on outputs on patterned clothing or extreme poses when reference images are included would show the guidance is not reliable.
Figures
read the original abstract
We introduce RefTon, a flux-based person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. Unlike conventional approaches that rely on complex auxiliary inputs such as body parsing and warped mask or require finely designed extract branches to process various input conditions, RefTon streamlines the process by directly generating try-on results from a source image and a target garment, without the need for structural guidance or auxiliary components to handle diverse inputs. Moreover, inspired by human clothing selection behavior, RefTon leverages additional reference images (the target garment worn on different individuals) to provide powerful guidance for refining texture alignment and maintaining the garment details. To enable this capability, we built a dataset containing unpaired reference images for training. Extensive experiments on public benchmarks demonstrate that RefTon achieves competitive or superior performance compared to state-of-the-art methods, while maintaining a simple and efficient person-to-person design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RefTon, a Flux-based person-to-person virtual try-on framework that generates results directly from a source person image and target garment image. It avoids complex auxiliary inputs such as body parsing or warped masks and instead incorporates additional unpaired reference images (the target garment worn on different individuals) to refine texture alignment and garment details. A custom dataset of such references was constructed for training, and the abstract states that extensive experiments on public benchmarks show competitive or superior performance to state-of-the-art methods while preserving a simple design.
Significance. If the empirical claims hold and the reference integration proves robust, the work could meaningfully simplify virtual try-on pipelines by reducing dependence on structural guidance and auxiliary branches. The use of unpaired person-shot references to mimic human clothing selection offers a plausible route to better detail preservation, which would be valuable for practical e-commerce applications if the gains are reproducible and artifact-free across pose and lighting variations.
major comments (2)
- [Abstract] Abstract: the claim that RefTon 'achieves competitive or superior performance compared to state-of-the-art methods' is unsupported by any quantitative metrics, tables, ablation studies, or error analysis, which is load-bearing for the central empirical contribution.
- [Method] Method description (reference-image usage): the conditioning pathway for the unpaired reference images is unspecified (e.g., whether they are encoded in a separate branch, injected through cross-attention, concatenated at the input, or used only at inference), which is critical to verify that they supply consistent texture guidance rather than introducing artifacts when body shape, pose, or illumination differ from the target person.
minor comments (1)
- [Abstract] Abstract: the phrase 'person-to-person design' would benefit from a short clarification distinguishing it from standard garment-to-person try-on formulations.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and revised the paper to address the concerns regarding empirical support and methodological clarity. Below we provide point-by-point responses.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that RefTon 'achieves competitive or superior performance compared to state-of-the-art methods' is unsupported by any quantitative metrics, tables, ablation studies, or error analysis, which is load-bearing for the central empirical contribution.
Authors: We appreciate this observation. The manuscript describes extensive experiments on public benchmarks and includes qualitative comparisons, but we agree that the abstract claim would be more robust with explicit quantitative backing. In the revised manuscript we have added a dedicated results table reporting standard virtual try-on metrics (FID, LPIPS, SSIM) along with a user study, which supports the statement of competitive or superior performance. We have also updated the abstract to reference these quantitative findings. revision: yes
-
Referee: [Method] Method description (reference-image usage): the conditioning pathway for the unpaired reference images is unspecified (e.g., whether they are encoded in a separate branch, injected through cross-attention, concatenated at the input, or used only at inference), which is critical to verify that they supply consistent texture guidance rather than introducing artifacts when body shape, pose, or illumination differ from the target person.
Authors: We agree that the integration mechanism for the reference images requires explicit description to ensure reproducibility and to clarify robustness under varying conditions. The original submission outlined the overall person-to-person pipeline but did not detail the conditioning pathway. In the revised Section 3 we now specify that each unpaired reference image is passed through the Flux VAE encoder, after which its latent features are injected into the transformer blocks via dedicated cross-attention layers (separate from the garment and person conditioning). This allows the model to selectively attend to texture and detail cues while remaining robust to differences in body shape, pose, and lighting, as further supported by the added ablation studies. revision: yes
Circularity Check
No circularity: empirical method with no derivations or self-referential reductions
full rationale
The RefTon paper describes a Flux-based virtual try-on architecture that incorporates unpaired reference images for texture guidance and reports competitive performance via experiments on public benchmarks. No equations, first-principles derivations, or mathematical predictions appear in the provided text. Claims rest on architectural choices (direct person-to-person generation without body parsing or auxiliary branches) and a custom dataset, evaluated externally rather than fitted or defined circularly against the target results. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing manner that reduces the central contribution to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Flux-based diffusion models can directly synthesize realistic try-on images from source person and garment inputs when supplemented with unpaired references.
Forward citations
Cited by 3 Pith papers
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds
Aligning the DDIM forward diffusion process with flow-matching manifold evolution enables high-quality generation without time conditioning, and class-conditional synthesis is possible with an unconditional denoiser b...
Reference graph
Works this paper leans on
-
[1]
Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020
work page 2020
-
[2]
Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239, 2020. URLhttps://api.semanticscholar.org/CorpusID:219955663
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[3]
Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer
Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. URLhttps://api.semanticscholar.org/CorpusID:245335280
work page 2022
-
[4]
Viton-hd: High-resolution virtual try-on via misalignment-aware normalization
Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021
work page 2021
-
[5]
Viton: An image-based virtual try-on network
Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018
work page 2018
-
[6]
Toward characteristic- preserving image-based virtual try-on network
Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward characteristic- preserving image-based virtual try-on network. InProceedings of the European conference on computer vision (ECCV), pages 589–604, 2018
work page 2018
-
[7]
Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment
Mengting Chen, Xi Chen, Zhonghua Zhai, Chen Ju, Xuewen Hong, Jinsong Lan, and Shuai Xiao. Wear-any-way: Manipulable virtual try-on via sparse correspondence alignment. InEuropean Conference on Computer Vision, pages 124–142. Springer, 2024
work page 2024
-
[8]
Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on
Yuhao Xu, Tao Gu, Weifeng Chen, and Chengcai Chen. Ootdiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on. InAAAI Conference on Artificial Intelligence, 2024. URL https://api. semanticscholar.org/CorpusID:268247604
work page 2024
-
[9]
Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on
Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024
work page 2024
-
[10]
Tryondiffusion: A tale of two unets
Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4606–4615, 2023
work page 2023
-
[11]
Magicanimate: Temporally consistent human image animation using diffusion model
Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024
work page 2024
-
[12]
Improving diffusion models for authentic virtual try-on in the wild
Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for authentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024
work page 2024
-
[13]
Jeongho Kim, Hoiyeong Jin, Sunghyun Park, and Jaegul Choo. Promptdresser: Improving the quality and control- lability of virtual try-on via generative textual prompt and prompt-aware mask.arXiv preprint arXiv:2412.16978, 2024
-
[14]
Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models.arXiv preprint arXiv:2407.15886, 2024
-
[15]
Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation.arXiv preprint arXiv:2501.11325, 2025
-
[16]
Omnivton: Training-free universal virtual try-on.arXiv preprint arXiv:2507.15037, 2025
Zhaotong Yang, Yuhui Li, Shengfeng He, Xinzhe Li, Yangyang Xu, Junyu Dong, and Yong Du. Omnivton: Training-free universal virtual try-on.arXiv preprint arXiv:2507.15037, 2025. 11 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT
-
[17]
Densepose: Dense human pose estimation in the wild
Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018
work page 2018
-
[18]
Deeppose: Human pose estimation via deep neural networks
Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014
work page 2014
-
[19]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019
work page 2019
-
[20]
Realtime multi-person 2d pose estimation using part affinity fields
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. InCVPR, 2017
work page 2017
-
[21]
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. InCVPR, 2016
work page 2016
-
[22]
Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self-correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. doi:10.1109/TPAMI.2020.3048039
-
[23]
Towards unified human parsing and pose estimation
Jian Dong, Qiang Chen, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Towards unified human parsing and pose estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 843–850, 2014
work page 2014
-
[24]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023
work page 2023
-
[25]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.007...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Dress code: High-resolution multi-category virtual try-on
Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High-resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022
work page 2022
-
[27]
Vivid: Video virtual try-on using diffusion models,
Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng-Jun Zha. Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794, 2024
-
[28]
Virtually trying on new clothing with arbitrary poses
Na Zheng, Xuemeng Song, Zhaozheng Chen, Linmei Hu, Da Cao, and Liqiang Nie. Virtually trying on new clothing with arbitrary poses. InProceedings of the 27th ACM international conference on multimedia, pages 266–274, 2019
work page 2019
-
[29]
Imagdressing- v1: Customizable virtual dressing
Fei Shen, Xin Jiang, Xin He, Hu Ye, Cong Wang, Xiaoyu Du, Zechao Li, and Jinghui Tang. Imagdressing- v1: Customizable virtual dressing. InAAAI Conference on Artificial Intelligence, 2024. URL https://api. semanticscholar.org/CorpusID:271244829
work page 2024
-
[30]
Zijian He, Peixin Chen, Guangrun Wang, Guanbin Li, Philip HS Torr, and Liang Lin. Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models.arXiv preprint arXiv:2407.10625, 2024
-
[31]
Ming Meng, Qi Dong, Jiajie Li, Zhe Zhu, Xingyu Wang, Zhaoxin Fan, Wei Zhao, and Wenjun Wu. Hf-vton: High-fidelity virtual try-on via consistent geometric and semantic alignment.arXiv preprint arXiv:2505.19638, 2025
-
[32]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[33]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015
work page 2015
-
[34]
Score- based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=PxTIG12RRHS
work page 2021
-
[35]
Density estimation using Real NVP
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[36]
NICE: Non-linear Independent Components Estimation
Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation.arXiv preprint arXiv:1410.8516, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[37]
Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions.Advances in neural information processing systems, 31, 2018. 12 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT
work page 2018
-
[38]
Neural ordinary differential equations
Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Kristjanson Duvenaud. Neural ordinary differential equations. InNeural Information Processing Systems, 2018. URL https://api.semanticscholar.org/ CorpusID:49310446
work page 2018
-
[39]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.ArXiv, abs/2210.02747, 2022. URL https://api.semanticscholar.org/CorpusID:252734897
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
Ladi- vton: Latent diffusion textual-inversion enhanced virtual try-on
Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi- vton: Latent diffusion textual-inversion enhanced virtual try-on. InProceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023
work page 2023
-
[41]
Diffusionclip: Text-guided diffusion models for robust image manipulation
Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2426–2435, 2022
work page 2022
-
[42]
Taming the power of diffusion models for high-quality virtual try-on with appearance flow
Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. InProceedings of the 31st ACM International Conference on Multimedia, pages 7599–7607, 2023
work page 2023
-
[43]
Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025
Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yuxiong Wu, and Bin Wang. Omnitry: Virtual try-on anything without masks.arXiv preprint arXiv:2508.13632, 2025
-
[44]
Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks
Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Jiaming Liu, and Chuang Zhang. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19085–19096, 2025
work page 2025
-
[45]
Riza Velioglu, Petra Bevandic, Robin Chan, and Barbara Hammer. Enhancing person-to-person virtual try-on with multi-garment virtual try-off.arXiv preprint arXiv:2504.13078, 2025
-
[46]
Nannan Zhang, Zhenyu Xie, Zhengwentai Sun, Hairui Zhu, Zirong Jin, Nan Xiang, Xiaoguang Han, and Song Wu. Viton-gun: Person-to-person virtual try-on via garment unwrapping.IEEE Transactions on Visualization and Computer Graphics, 31(10):7740–7751, 2025. doi:10.1109/TVCG.2025.3550776
-
[47]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[48]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[49]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[50]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[51]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
work page 2022
-
[54]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Jim Nilsson and Tomas Akenine-Möller. Understanding ssim.arXiv preprint arXiv:2006.13846, 2020
-
[56]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
work page 2018
-
[57]
pytorch-fid: FID Score for PyTorch
Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.3.0
work page 2020
-
[58]
Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[59]
keep the {target cloth} cloth unchanged
Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tongtong Wang, and An-An Liu. Cat-dm: Controllable accelerated virtual try-on with diffusion model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8372–8382, 2024. 13 RefVTON: person-to-person Try on with Additional Unpaired Visual ReferenceA PREPRINT A Appendix...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.