The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
Pith reviewed 2026-05-16 20:03 UTC · model grok-4.3
The pith
Keyframe-driven details injection into standard DiT blocks improves garment fidelity and background integrity in video virtual try-on.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KeyTailor uses keyframe sampling to identify frames rich in foreground dynamics and background consistency, then routes garment information through a details enhancement module and background information through a collaborative optimization module. The resulting enriched latents are combined with standard conditioning inputs and fed into unmodified DiT blocks, producing try-on videos that maintain realistic garment movement and scene coherence across frames without architectural changes or extra interaction modules.
What carries the argument
Keyframe-driven details injection strategy, which samples keyframes and employs garment details enhancement and collaborative background optimization modules to distill information for direct injection into standard DiT blocks.
Load-bearing premise
Keyframes inherently contain both foreground dynamics and background consistency in sufficient quality to allow the two modules to distill useful details for injection.
What would settle it
An ablation that disables the keyframe sampling and detail-injection modules entirely and measures whether garment fidelity and background integrity scores on the same test videos drop to or below those of the strongest baseline DiT method.
Figures
read the original abstract
Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently,two tailored keyframe-driven modules, the garment details enhancement module and the collaborative background optimization module, are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes.These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes KeyTailor, a framework for video virtual try-on that uses an instruction-guided keyframe sampling strategy to select informative frames, followed by a garment details enhancement module and a collaborative background optimization module to distill and inject enriched latents into unmodified DiT blocks along with pose, mask, and noise. It also introduces the ViT-HD dataset of 15,070 high-resolution videos. The central claim is that this keyframe-driven approach achieves superior garment fidelity and background integrity compared to state-of-the-art baselines in both dynamic and static scenarios while avoiding additional DiT complexity.
Significance. If the empirical claims hold, the work offers a practical efficiency gain for DiT-based VVT by leveraging inherent keyframe properties rather than architectural modifications, and the release of ViT-HD addresses a clear data-scale limitation in the field. The design credits the avoidance of extra interaction modules and the focus on detail injection as strengths for reproducibility and generalization.
major comments (3)
- [§3.2] §3.2 (Keyframe-driven modules): The core assumption that keyframes inherently supply both foreground garment dynamics and background consistency for effective distillation is load-bearing for the performance claim, yet the manuscript provides no targeted validation (e.g., ablation on motion speed or occlusion cases) showing that the distilled latents actually capture critical folds or rapid changes; without this, gains in fidelity do not necessarily follow from the architecture.
- [§4] §4 (Experiments): The outperformance claim over baselines is central but the reported results lack quantitative metrics (e.g., FID, LPIPS, or garment-specific scores), ablation tables, error analysis, or statistical tests; the abstract's qualitative statement alone is insufficient to verify the central performance assertion.
- [§3.3] §3.3 (Injection mechanism): The description of how garment-related and background-optimized latents are combined with pose/mask/noise inputs and injected into standard DiT blocks is high-level; precise equations or a diagram showing the latent fusion operation are needed to confirm that no implicit modifications to the DiT occur and to support reproducibility.
minor comments (3)
- [Abstract] Abstract: Typo in 'Subsequently,two' (missing space after comma).
- [Abstract] Abstract: Dataset size written as '15, 070' should be standardized to '15,070'.
- [§4.1] §4.1: Figure captions for dynamic vs. static examples could explicitly note the keyframe selection criteria used in each case.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and commit to revising the manuscript accordingly to strengthen the presentation and validation of our claims.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Keyframe-driven modules): The core assumption that keyframes inherently supply both foreground garment dynamics and background consistency for effective distillation is load-bearing for the performance claim, yet the manuscript provides no targeted validation (e.g., ablation on motion speed or occlusion cases) showing that the distilled latents actually capture critical folds or rapid changes; without this, gains in fidelity do not necessarily follow from the architecture.
Authors: We agree that targeted validation of the keyframe assumption is important to support the performance claims. The manuscript motivates the approach by noting that keyframes inherently contain foreground dynamics and background consistency, but we acknowledge the lack of specific ablations on motion speed or occlusion. In the revision, we will add experiments including ablations on these cases and analysis showing how the distilled latents capture critical folds and rapid changes. revision: yes
-
Referee: [§4] §4 (Experiments): The outperformance claim over baselines is central but the reported results lack quantitative metrics (e.g., FID, LPIPS, or garment-specific scores), ablation tables, error analysis, or statistical tests; the abstract's qualitative statement alone is insufficient to verify the central performance assertion.
Authors: We acknowledge that the experimental results would be more convincing with additional quantitative support. While the manuscript includes qualitative comparisons demonstrating superiority in garment fidelity and background integrity, we will expand the experiments section in the revision to include quantitative metrics such as FID, LPIPS, garment-specific scores, full ablation tables, error analysis, and statistical tests. revision: yes
-
Referee: [§3.3] §3.3 (Injection mechanism): The description of how garment-related and background-optimized latents are combined with pose/mask/noise inputs and injected into standard DiT blocks is high-level; precise equations or a diagram showing the latent fusion operation are needed to confirm that no implicit modifications to the DiT occur and to support reproducibility.
Authors: We agree that the injection mechanism description can be made more precise to aid reproducibility. The manuscript emphasizes that the design injects enriched latents into unmodified DiT blocks without additional complexity, but we will revise §3.3 to include precise equations for the latent fusion operation and a diagram illustrating the combination with pose, mask, and noise inputs. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces a new framework (KeyTailor) and dataset (ViT-HD) with keyframe-driven modules for details injection into unmodified DiT blocks. No equations, derivations, or fitted parameters are described that reduce any prediction or result to its own inputs by construction. The motivating assumption about keyframes is presented as an empirical observation rather than a self-referential definition or self-citation chain. The central claims rest on architectural novelty and experimental validation, remaining self-contained without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Keyframes inherently contain both foreground dynamics and background consistency
Reference graph
Works this paper leans on
-
[1]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2, 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186,
-
[4]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inpro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 13
work page 2017
-
[5]
Crossvit: Cross-attention multi-scale vision transformer for image classification
Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 357–366,
-
[6]
Viton-hd: High-resolution virtual try-on via misalignment-aware normalization
Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021. 6, 7, 12
work page 2021
-
[7]
Improving diffusion models for au- thentic virtual try-on in the wild
Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for au- thentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024. 7
work page 2024
-
[8]
Catvton: Concatenation is all you need for virtual try-on with diffusion models, 2025
Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models, 2025. 7
work page 2025
-
[9]
Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation. arXiv preprint arXiv:2501.11325, 2025. 2, 4, 7, 8, 12
-
[10]
Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining
Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151,
-
[11]
Fw-gan: Flow-navigated warping gan for video virtual try-on
Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. Fw-gan: Flow-navigated warping gan for video virtual try-on. InProceedings of the IEEE/CVF international conference on computer vision, pages 1161–1170, 2019. 2, 3, 6, 7, 12
work page 2019
-
[12]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[13]
Vivid: Video virtual try-on using diffusion models,
Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng- Jun Zha. Vivid: Video virtual try-on using diffusion models. arXiv preprint arXiv:2405.11794, 2024. 2, 3, 6, 7, 8, 12
-
[14]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 13
work page 2018
-
[15]
Wildvidfit: Video virtual try- on in the wild via image-based controlled diffusion models
Zijian He, Peixin Chen, Guangrun Wang, Guanbin Li, Philip HS Torr, and Liang Lin. Wildvidfit: Video virtual try- on in the wild via image-based controlled diffusion models. InEuropean Conference on Computer Vision, pages 123–
-
[16]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4, 12
work page 2022
-
[17]
Fitdit: Advancing the authentic garment details for high-fidelity virtual try-on
Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Chengming Xu, Jinlong Peng, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, and Yanwei Fu. Fitdit: Advancing the authentic garment details for high-fidelity virtual try-on. arXiv preprint arXiv:2411.10499, 2024. 6
-
[18]
Cloth- former: Taming video virtual try-on in all module
Jianbin Jiang, Tan Wang, He Yan, and Junhui Liu. Cloth- former: Taming video virtual try-on in all module. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10799–10808, 2022. 7, 12
work page 2022
-
[19]
Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on
Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024. 7
work page 2024
-
[20]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 12
work page 2024
-
[22]
Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, and Peng- Tao Jiang. Magictryon: Harnessing diffusion transformer 9 for garment-preserving video virtual try-on.arXiv preprint arXiv:2505.21325, 2025. 2, 4, 5, 7, 8, 12, 13
-
[23]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. InPro- ceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023. 4
work page 2023
-
[24]
Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self- correction for human parsing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):3260–3271, 2020. 3, 4, 13
work page 2020
-
[25]
Siqi Li, Zhengkai Jiang, Jiawei Zhou, Zhihong Liu, Xiaowei Chi, and Haoqian Wang. Realvvt: Towards photorealistic video virtual try-on via spatio-temporal consistency.CoRR, abs/2501.08682, 2025. 2
-
[26]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 12
work page 2023
-
[27]
Dress code: High- resolution multi-category virtual try-on
Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High- resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022. 6, 7
work page 2022
-
[28]
Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on
Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Mar- cella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023. 7
work page 2023
-
[29]
Swifttry: Fast and consistent video virtual try- on with diffusion models
Hung Nguyen, Quang Qui-Vinh Nguyen, Khoi Nguyen, and Rang Nguyen. Swifttry: Fast and consistent video virtual try- on with diffusion models. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 6200–6208, 2025. 2
work page 2025
-
[30]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[31]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 8, 12
work page 2022
-
[33]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 2, 12
work page 2015
-
[34]
Learning spatiotemporal features with 3d convolutional networks
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InProceedings of the IEEE inter- national conference on computer vision, pages 4489–4497,
-
[35]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 2, 8, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Gpd-vvto: Preserving garment details in video virtual try-on
Yuanbin Wang, Weilun Dai, Long Chan, Huanyu Zhou, Aixi Zhang, and Si Liu. Gpd-vvto: Preserving garment details in video virtual try-on. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7133–7142,
-
[37]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 13
work page 2004
-
[38]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023. 12
work page 2023
-
[39]
Aggregated residual transformations for deep neural networks
Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500,
-
[40]
Ootd- iffusion: Outfitting fusion based latent diffusion for control- lable virtual try-on
Yuhao Xu, Tao Gu, Weifeng Chen, and Arlene Chen. Ootd- iffusion: Outfitting fusion based latent diffusion for control- lable virtual try-on. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8996–9004, 2025. 3, 7
work page 2025
-
[41]
Tunnel try-on: Excavating spatial-temporal tunnels for high-quality virtual try-on in videos
Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, and Changxin Gao. Tunnel try-on: Excavating spatial-temporal tunnels for high-quality virtual try-on in videos. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 3199–3208, 2024. 2, 12
work page 2024
-
[42]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Self-attention generative adversarial networks
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus- tus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354–
-
[44]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 13
work page 2018
-
[45]
Jun Zheng, Jing Wang, Fuwei Zhao, Xujie Zhang, and Xiaodan Liang. Dynamic try-on: Taming video virtual try-on with dynamic attention mechanism.arXiv preprint arXiv:2412.09822, 2024. 12
-
[46]
Mv-ton: Memory-based video virtual try- on network
Xiaojing Zhong, Zhonghua Wu, Taizhe Tan, Guosheng Lin, and Qingyao Wu. Mv-ton: Memory-based video virtual try- on network. InProceedings of the 29th ACM International Conference on Multimedia, pages 908–916, 2021. 7, 12
work page 2021
-
[47]
Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, 10 Mingyuan Gao, and Xin Dong. Dreamvvt: Mastering realis- tic video virtual try-on in the wild via a stage-wise diffusion transformer framework.arXiv preprint arXiv:2508.02807,
-
[48]
2, 4, 7, 12 11 Appendix A. Related Work Video virtual try-on (VVT) aims to replace a person’s cloth- ing with a target garment while preserving the spatiotem- poral consistency of the video, i.e., the generated results should ensure a consistent appearance of the target garment across frames, align seamlessly with the person’s pose and motion, and maintai...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.