FEAT: Fashion Editing and Try-On from Any Design
Pith reviewed 2026-05-08 19:04 UTC · model grok-4.3
The pith
FEAT enables editing and virtual try-on of garments and accessories from any design source including artwork and photographs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present FEAT (Fashion Editing And Try-On from Any Design), a method that enables editing and try-on across garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design
What carries the argument
Disentangled Dual Injection (DDI) that disentangles and injects content and style cues from apparel or non-apparel sources, paired with Orthogonal-Guided Noise Fusion (OGNF) that projects out residual garment signals and applies targeted noise for clean replacement of garments and accessories.
If this is right
- The method supports complete outfits that include both garments and accessories from any design source.
- It achieves state-of-the-art results in design flexibility, prompt consistency, and visual realism.
- Orthogonal-Guided Noise Fusion operates without additional training to remove residuals and fuse new elements.
- Design inputs can come from creative sources such as artwork, abstract imagery, and natural photographs.
Where Pith is reading between the lines
- Users could upload personal sketches or photos as direct inspiration without first locating matching garment reference images.
- The same disentanglement-plus-projection pattern might apply to other structured image synthesis tasks where inputs and outputs share little visual overlap.
- Real-time interfaces could let non-experts iterate on fashion concepts by swapping any visual prompt and instantly viewing body results.
Load-bearing premise
Disentangled Dual Injection and Orthogonal-Guided Noise Fusion can reliably pull usable design cues from non-apparel images and map them onto human bodies to produce coherent, artifact-free results.
What would settle it
If feeding abstract artwork or landscape photos as design inputs produces outputs with mismatched textures, distorted fits, incomplete accessories, or visible artifacts on the human model, the central claim would fail.
Figures
read the original abstract
Fashion design aims to express a designer's creative intent and to depict how garments interact with the human body. Recent methods condition on multimodal inputs to support garment editing and virtual try-on. However, existing methods still (i) confine design to garment-related images, excluding creative design sources such as artwork, abstract imagery, and natural photographs, and (ii) cannot support complete outfits, including accessories. We present FEAT (Fashion Editing And Try-On from Any Design), a method that enables editing and try-on across garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design flexibility, prompt consistency, and visual realism.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce FEAT (Fashion Editing And Try-On from Any Design), a method that enables editing and try-on across garments and accessories using diverse design sources including non-apparel ones like artwork, abstract imagery, and natural photographs. It introduces Disentangled Dual Injection (DDI) that takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Additionally, it proposes Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. The paper asserts that extensive experiments demonstrate state-of-the-art performance in design flexibility, prompt consistency, and visual realism.
Significance. If the results hold, this work would be significant for the computer vision and graphics community working on virtual try-on and image synthesis. By allowing design inputs from arbitrary sources, it addresses key limitations in existing methods that are restricted to garment-related images. The training-free OGNF mechanism is a strength as it does not require additional model training. This could lead to more flexible and creative applications in fashion design and e-commerce. However, the significance is contingent upon rigorous empirical validation which is referenced but not detailed in the abstract.
major comments (1)
- [Abstract] The abstract asserts state-of-the-art results in flexibility, prompt consistency, and realism but supplies no metrics, baselines, ablation studies, or experimental details to support the claims. This is load-bearing for the central claim that DDI and OGNF successfully disentangle and transfer design cues from non-apparel sources to produce coherent, artifact-free outputs.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the opportunity to clarify our work. We address the single major comment below and outline a targeted revision to the abstract.
read point-by-point responses
-
Referee: [Abstract] The abstract asserts state-of-the-art results in flexibility, prompt consistency, and realism but supplies no metrics, baselines, ablation studies, or experimental details to support the claims. This is load-bearing for the central claim that DDI and OGNF successfully disentangle and transfer design cues from non-apparel sources to produce coherent, artifact-free outputs.
Authors: We agree that the abstract is concise by design and does not embed the quantitative details. The full manuscript (Sections 4 and 5) supplies the requested support: quantitative comparisons against multiple baselines using standard metrics for realism (FID, LPIPS), prompt consistency (CLIP similarity), and design flexibility (user studies and region-specific success rates), together with ablations isolating the contributions of DDI and OGNF. These experiments directly validate the disentanglement and artifact-free transfer claims for both apparel and non-apparel inputs. To address the referee's concern, we will revise the abstract to include one or two concise quantitative highlights drawn from the experimental results while respecting length limits. revision: yes
Circularity Check
No significant circularity; method claims are architectural, not derived from self-referential fits or citations
full rationale
The paper introduces FEAT as a new conditional generation architecture for fashion editing and try-on, defining Disentangled Dual Injection (DDI) and Orthogonal-Guided Noise Fusion (OGNF) as novel mechanisms that operate on multimodal design inputs. No equations, loss functions, or performance predictions are presented that reduce by construction to fitted parameters, self-citations, or renamed prior results. The central claims rest on the empirical behavior of the proposed components rather than any load-bearing derivation chain that collapses to the inputs. This is a standard descriptive CV method paper with no detectable circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dream- styler: Paint by style inversion with text-to-image diffusion models
Namhyuk Ahn, Junsoo Lee, Chunggi Lee, Kunhee Kim, Daesik Kim, Seung-Hun Nam, and Kibeom Hong. Dream- styler: Paint by style inversion with text-to-image diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 674–681, 2024. 7
2024
-
[2]
Multimodal garment designer: Human-centric latent diffusion models for fashion image editing
Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Mar- cella Cornia, Marco Bertini, and Rita Cucchiara. Multimodal garment designer: Human-centric latent diffusion models for fashion image editing. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2023. 2, 3, 5
2023
-
[3]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 3
2023
-
[4]
Viton-hd: High-resolution virtual try-on via misalignment-aware normalization
Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProc. of the IEEE conference on computer vision and pattern recognition (CVPR), 2021. 5
2021
-
[5]
Im- age style transfer using convolutional neural networks
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im- age style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016. 2
2016
-
[6]
Lots of fashion! multi- conditioning for image generation via sketch-text pairing
Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, and Marco Cristani. Lots of fashion! multi- conditioning for image generation via sketch-text pairing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19711–19720, 2025. 3
2025
-
[7]
Text2human: Text-driven controllable human image generation.ACM Transactions on Graphics (TOG), 41(4):1–11, 2022
Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, and Ziwei Liu. Text2human: Text-driven controllable human image generation.ACM Transactions on Graphics (TOG), 41(4):1–11, 2022. 3
2022
-
[8]
Byungjun Kim, Soobin Um, and Jong Chul Ye. Diverse text-to-image generation via contrastive noise optimization. arXiv preprint arXiv:2510.03813, 2025. 4
-
[9]
Improving transferabil- ity in image classification through refinement of discrimina- tive features.IEEE Transactions on Artificial Intelligence,
HyunGi Kim, Seungryong Yoo, Bong Gyun Kang, Saehyung Lee, Jaekoo Lee, and Sungroh Yoon. Improving transferabil- ity in image classification through refinement of discrimina- tive features.IEEE Transactions on Artificial Intelligence,
-
[10]
Bayesian prin- ciples improve prompt learning in vision-language models
Mingyu Kim, Jongwoo Ko, and Mijung Park. Bayesian prin- ciples improve prompt learning in vision-language models. arXiv preprint arXiv:2504.14123, 2025. 2
-
[11]
Bridging the domain gap towards generaliza- tion in automatic colorization
Hyejin Lee, Daehee Kim, Daeun Lee, Jinkyu Kim, and Jaekoo Lee. Bridging the domain gap towards generaliza- tion in automatic colorization. InEuropean Conference on Computer Vision, pages 527–543. Springer, 2022. 5
2022
-
[12]
Controllable 3d object genera- tion with single image prompt
Jaeseok Lee and Jaekoo Lee. Controllable 3d object genera- tion with single image prompt. InInternational Conference on Pattern Recognition, pages 222–238. Springer, 2025. 3
2025
-
[13]
Laplacian-steered neural style transfer
Shaohua Li, Xinxing Xu, Liqiang Nie, and Tat-Seng Chua. Laplacian-steered neural style transfer. InProceedings of the 25th ACM international conference on Multimedia, pages 1716–1724, 2017. 2
2017
-
[14]
Fashiontex: Control- lable virtual try-on with text and texture
Anran Lin, Nanxuan Zhao, Shuliang Ning, Yuda Qiu, Baoyuan Wang, and Xiaoguang Han. Fashiontex: Control- lable virtual try-on with text and texture. InACM SIG- GRAPH 2023 conference proceedings, pages 1–9, 2023. 2, 3
2023
-
[15]
Repaint: Inpainting using denoising diffusion probabilistic models, 2022
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models, 2022. 5
2022
-
[16]
Dress code: High- resolution multi-category virtual try-on, 2022
Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High- resolution multi-category virtual try-on, 2022. 5
2022
-
[17]
Picture: Photorealistic virtual try-on from unconstrained designs
Shuliang Ning, Duomin Wang, Yipeng Qin, Zirong Jin, Baoyuan Wang, and Xiaoguang Han. Picture: Photorealistic virtual try-on from unconstrained designs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6976–6985, 2024. 2, 3, 5
2024
-
[18]
Styleclip: Text-driven manipulation of stylegan imagery
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 2085–2094,
2085
-
[19]
On lines and planes of closest fit to systems of points in space.The London, Edinburgh, and Dublin Philo- sophical Magazine and Journal of Science, 2(11):559–572,
Karl Pearson. On lines and planes of closest fit to systems of points in space.The London, Edinburgh, and Dublin Philo- sophical Magazine and Journal of Science, 2(11):559–572,
-
[20]
Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 5
2023
-
[21]
Linzi Qu, Jiaxiang Shang, Hui Ye, Xiaoguang Han, and Hongbo Fu. Sketch2human: Deep human generation with disentangled geometry and appearance control.arXiv preprint arXiv:2404.15889, 2024. 5
-
[22]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2, 5
2021
-
[23]
Image-based vir- tual try-on: A survey.International Journal of Computer Vi- sion, 133(5):2692–2720, 2025
Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, Mohan Kankanhalli, and An-An Liu. Image-based vir- tual try-on: A survey.International Journal of Computer Vi- sion, 133(5):2692–2720, 2025. 4
2025
-
[24]
Improved artgan for conditional synthesis of natural image and artwork.IEEE Transactions on Image Processing, 28(1):394–409, 2019
Wei Ren Tan, Chee Seng Chan, Hernan Aguirre, and Kiyoshi Tanaka. Improved artgan for conditional synthesis of natural image and artwork.IEEE Transactions on Image Processing, 28(1):394–409, 2019. 5
2019
-
[25]
arXiv preprint arXiv:2404.02733 (2024)
Haofan Wang, Matteo Spinelli, Qixun Wang, Xu Bai, Zekui Qin, and Anthony Chen. Instantstyle: Free lunch towards style-preserving in text-to-image generation.arXiv preprint arXiv:2404.02733, 2024. 2, 3
-
[26]
Texfit: Text-driven fashion im- age editing with diffusion models
Tongxin Wang and Mang Ye. Texfit: Text-driven fashion im- age editing with diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10198– 10206, 2024. 3
2024
-
[27]
Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 2
2004
-
[28]
Gpt-4v (ision) is a human-aligned evaluator for text-to-3d genera- tion
Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d genera- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 22227–22238,
-
[29]
Paint by ex- ample: Exemplar-based image editing with diffusion models
Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by ex- ample: Exemplar-based image editing with diffusion models. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18381–18391, 2023. 3
2023
-
[30]
Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models. 2023. 2, 3, 5, 7
2023
-
[31]
Adding conditional control to text-to-image diffusion models, 2023
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 2, 3, 5
2023
-
[32]
Uni-controlnet: All-in-one control to text-to-image diffusion models.Advances in Neural Information Processing Sys- tems, 36:11127–11150, 2023
Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K Wong. Uni-controlnet: All-in-one control to text-to-image diffusion models.Advances in Neural Information Processing Sys- tems, 36:11127–11150, 2023. 3
2023
-
[33]
Puff-net: Efficient style transfer with pure content and style feature fusion network
Sizhe Zheng, Pan Gao, Peng Zhou, and Jie Qin. Puff-net: Efficient style transfer with pure content and style feature fusion network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8059– 8068, 2024. 2
2024
-
[34]
Lin Zhu, Xinbing Wang, Chenghu Zhou, Qinying Gu, and Nanyang Ye. Less is more: Masking elements in image con- dition features avoids content leakages in style transfer dif- fusion models.arXiv preprint arXiv:2502.07466, 2025. 5, 6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.