pith. machine review for the scientific record. sign in

arxiv: 2511.18957 · v2 · submitted 2025-11-24 · 💻 cs.CV

Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

Pith reviewed 2026-05-17 06:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords video virtual try-onclose-up video generationgarment texture preservationvirtual try-on datasetfashion video synthesisconsistency metrichigh-resolution garment images
0
0 comments X

The pith

A new dataset with high-resolution close-up garment images and real-model videos lets existing video models generate virtual try-on results that preserve fabric textures and details far better.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current virtual try-on systems fall short for business needs because they train on single flat garment photos and produce only full-body shots. To fix this, the authors release a dataset that pairs detailed close-up garment photographs and text descriptions with actual full-shot and close-up videos of real people wearing the garments. Experiments show that feeding these richer images into off-the-shelf video generation models allows the models to pull out and apply fine texture information, raising the visual quality of close-up results. A new evaluation score called VGID is introduced to measure how consistently both texture and overall garment structure survive the generation process. If the approach works, marketers could create convincing close-up product videos without repeated physical filming.

Core claim

The central claim is that by utilizing the detailed images from the introduced dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. The dataset supplies high-fidelity close-up garment images together with textual descriptions and pairs them with both full-shot and close-up try-on videos captured on real human models. The new VGID metric is defined to quantify preservation of both texture and structure in these videos, and benchmarking shows that prior methods still lose fine garment details especially in close-up footage.

What carries the argument

The Eevee dataset of paired high-resolution close-up garment images, text descriptions, and real-model full-shot plus close-up videos, together with the VGID metric that scores joint texture and structural consistency.

If this is right

  • Video models conditioned on the dataset's close-up garment images produce virtual try-on footage whose fabric details remain sharper and more consistent across frames.
  • The VGID metric ranks current methods by how well they keep both fine texture and overall garment shape intact in close-up sequences.
  • Business video production gains a practical route to high-fidelity close-up marketing clips without additional live shoots.
  • Benchmark results highlight specific failure modes in texture transfer that future architectures must address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retail platforms could integrate the dataset into automated video pipelines to reduce the cost of creating detailed product demonstrations.
  • The same close-up conditioning strategy might transfer to other video synthesis domains that require preserving fine surface details, such as product repair or material simulation clips.
  • Extending the dataset to varied body types, lighting conditions, and garment categories would test how far the texture-extraction benefit generalizes.

Load-bearing premise

The collected close-up videos of real models are representative enough of everyday fashion marketing needs and the VGID score actually tracks the texture and structure qualities that matter to viewers.

What would settle it

Train several recent video models on the new dataset and measure whether close-up output videos show no improvement in texture detail when judged by human raters or by independent texture-matching measures compared with models trained on prior single-image datasets.

Figures

Figures reproduced from arXiv: 2511.18957 by Dan Song, Dongyang Jin, Jianhao Zeng, Lei Sun, Nannan Zhang, Ruidong Chen, Ryan Xu, Xiangxiang Chu, Xuanpu Zhang, Yancheng Bai.

Figure 1
Figure 1. Figure 1: An illustrative sample from our proposed virtual try-on dataset. Compared to existing datasets, ours provides a richer collection [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A comparison of garment image and video frame, along [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overall fine-tuning pipeline of the VACE on our dataset. Compared to other video-based virtual try-on models, our VACE [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: VBench evaluation results of video-based virtual try-on [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of close-up video virtual try-on results on the Eevee dataset. Our fine-tuned VACE model not only [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of the ablation study. The model combining LoRA fine-tuning and detailed images generates textures [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a high-resolution dataset for video-based virtual try-on featuring detailed garment images (including close-ups and textual descriptions) and paired full-shot/close-up try-on videos of real models. It proposes the VGID metric to quantify texture and structure preservation in close-up videos and claims that existing video generation models can extract better texture features from the detailed images, yielding more realistic results; a benchmark of recent models is also presented to highlight current limitations in texture and structural fidelity.

Significance. If the central claims hold, the work addresses a practical gap in fashion e-commerce by shifting focus from full-shot to close-up virtual try-on and supplying both richer input data and a specialized consistency metric. The dataset's dual provision of detailed garment captures and corresponding real-model videos is a clear strength that could enable more faithful texture transfer; the emphasis on a business-relevant evaluation setting is also positive. Significance is tempered by the need for stronger evidence that VGID reliably isolates the claimed texture gains.

major comments (2)
  1. [VGID definition and experiments] VGID metric (introduced to support the fine-grained evaluation claim): no human correlation study, no comparison against region-masked LPIPS or DINO features, and no ablation isolating texture fidelity from lighting/pose confounds are reported. Without these, lower VGID scores cannot be confidently attributed to the realism gains from the dataset's detailed close-up images, directly weakening the central experimental claim.
  2. [Experiments and benchmark] Experiments section and benchmark results: the abstract states that detailed images 'significantly enhance' realism and that the benchmark 'identifies' problems, yet no quantitative numbers, error bars, or per-model VGID breakdowns are visible in the provided text. This absence makes it impossible to verify whether the enhancement is load-bearing or merely incremental.
minor comments (2)
  1. [Metric definition] Notation for VGID components (texture vs. structure terms) should be defined explicitly with equations or pseudocode to avoid ambiguity when readers attempt to reproduce the metric.
  2. [Dataset description] Dataset statistics table (if present) would benefit from explicit comparison to prior virtual try-on datasets on resolution, close-up coverage, and garment variety to clarify the claimed novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your detailed and constructive review. We appreciate the feedback highlighting areas where the manuscript can be strengthened, particularly around the validation of the VGID metric and the presentation of experimental results. We address each major comment below and commit to revisions that will improve clarity and rigor without altering the core contributions.

read point-by-point responses
  1. Referee: [VGID definition and experiments] VGID metric (introduced to support the fine-grained evaluation claim): no human correlation study, no comparison against region-masked LPIPS or DINO features, and no ablation isolating texture fidelity from lighting/pose confounds are reported. Without these, lower VGID scores cannot be confidently attributed to the realism gains from the dataset's detailed close-up images, directly weakening the central experimental claim.

    Authors: We agree that further validation of VGID would strengthen the central claim. In the revised manuscript we will add a human correlation study in which participants rate texture and structure preservation on a subset of generated videos and we report Pearson/Spearman correlations with VGID. We will also include direct comparisons of VGID against region-masked LPIPS and DINO features on the same evaluation set. Finally, we will perform and report controlled ablations that fix lighting and pose while varying only garment texture detail, thereby isolating the contribution of the close-up garment images. These additions will be placed in a new subsection of the experiments. revision: yes

  2. Referee: [Experiments and benchmark] Experiments section and benchmark results: the abstract states that detailed images 'significantly enhance' realism and that the benchmark 'identifies' problems, yet no quantitative numbers, error bars, or per-model VGID breakdowns are visible in the provided text. This absence makes it impossible to verify whether the enhancement is load-bearing or merely incremental.

    Authors: We apologize that the quantitative evidence was not sufficiently explicit in the narrative. The manuscript already contains VGID scores, per-model comparisons, and benchmark tables in Section 4 and the associated figures. To make these results immediately verifiable, we will expand the experiments section with explicit numerical values, standard-error bars, and a consolidated table that reports VGID (texture and structure components) for every baseline on both full-shot and close-up videos. We will also add a short paragraph quantifying the improvement magnitude (e.g., relative VGID reduction) when detailed garment images are used versus single-image baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: new dataset and metric introduced as independent contributions

full rationale

The paper's core contributions are the creation of a new high-resolution video try-on dataset containing detailed garment close-ups and paired full/close-up videos, plus the definition of a new VGID metric for evaluating texture and structure preservation. These are presented as empirical resources rather than derived quantities. The claim that detailed images improve texture extraction in existing models is supported by experiments on the new data, not by fitting parameters to a target result or reducing via self-referential equations. No load-bearing step equates a prediction to its own input by construction, and no uniqueness theorem or ansatz is smuggled through self-citation. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard computer-vision assumptions about feature extraction and metric validity rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Inception-style features can quantify garment texture and structure preservation in video frames.
    Invoked when defining the VGID metric for close-up consistency evaluation.

pith-pipeline@v0.9.0 · 5581 in / 1211 out tokens · 80553 ms · 2026-05-17T06:30:11.874922+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting

    cs.CV 2026-04 unverdicted novelty 7.0

    GS-STVSR achieves state-of-the-art continuous spatio-temporal video super-resolution quality with nearly constant inference time at standard scales and over 3x speedup at extreme scales using 2D Gaussian Splatting.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

    Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 3

  2. [2]

    Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

    Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 4

  4. [4]

    Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186,

    Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186,

  5. [5]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inpro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 6

  6. [6]

    Anyscene: Customized image synthe- sis with composited foreground

    Ruidong Chen, Lanjun Wang, Weizhi Nie, Yongdong Zhang, and An-An Liu. Anyscene: Customized image synthe- sis with composited foreground. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8724–8733, 2024. 3

  7. [7]

    Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

    Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021. 3

  8. [8]

    Improving diffusion models for au- thentic virtual try-on in the wild

    Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for au- thentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024. 3

  9. [9]

    Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation.arXiv preprint arXiv:2501.11325, 2025

    Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation. arXiv preprint arXiv:2501.11325, 2025. 2, 3

  10. [10]

    Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021

    Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021. 3

  11. [11]

    Visionllama: A unified llama backbone for vision tasks

    Xiangxiang Chu, Jianlin Su, Bo Zhang, and Chunhua Shen. Visionllama: A unified llama backbone for vision tasks. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024. 3

  12. [12]

    Usp: Unified self-supervised pretraining for image generation and under- standing.ICCV, 2025

    Xiangxiang Chu, Renda Li, and Yong Wang. Usp: Unified self-supervised pretraining for image generation and under- standing.ICCV, 2025. 3

  13. [13]

    Towards multi-pose guided virtual try-on network

    Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards multi-pose guided virtual try-on network. InProceedings of the IEEE/CVF international conference on computer vision, pages 9026–9035, 2019. 3

  14. [14]

    Fw-gan: Flow-navigated warping gan for video virtual try-on

    Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. Fw-gan: Flow-navigated warping gan for video virtual try-on. InProceedings of the IEEE/CVF international conference on computer vision, pages 1161–1170, 2019. 2, 3

  15. [15]

    Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794, 2024

    Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng- Jun Zha. Vivid: Video virtual try-on using diffusion models. arXiv preprint arXiv:2405.11794, 2024. 2, 3, 6, 7

  16. [16]

    Taming the power of diffusion models for high-quality virtual try-on with appearance flow

    Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. InProceedings of the 31st ACM International Conference on Multimedia, pages 7599–7607, 2023. 3

  17. [17]

    Densepose: Dense human pose estimation in the wild

    Rıza Alp G ¨uler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018. 4

  18. [18]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 3

  19. [19]

    Viton: An image-based virtual try-on network

    Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018. 3

  20. [20]

    Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

    Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 6

  21. [21]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  22. [22]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6, 7

  23. [23]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 2, 5, 6, 7

  24. [24]

    Doubleu-net: A deep convolutional neural network for medical image segmen- tation

    Debesh Jha, Michael A Riegler, Dag Johansen, P ˚al Halvorsen, and H ˚avard D Johansen. Doubleu-net: A deep convolutional neural network for medical image segmen- tation. In2020 IEEE 33rd International symposium on computer-based medical systems (CBMS), pages 558–564. IEEE, 2020. 3

  25. [25]

    Cloth- former: Taming video virtual try-on in all module

    Jianbin Jiang, Tan Wang, He Yan, and Junhui Liu. Cloth- former: Taming video virtual try-on in all module. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10799–10808, 2022. 2

  26. [26]

    VACE: All-in-One Video Creation and Editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 2, 3, 4, 6, 7

  27. [27]

    Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on

    Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024. 3

  28. [28]

    Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025

    Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Lei Sun, and Xiangxiang Chu. Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025. 3

  29. [29]

    Pursuing temporal-consistent video virtual try-on via dynamic pose in- teraction

    Dong Li, Wenqi Zhong, Wei Yu, Yingwei Pan, Dingwen Zhang, Ting Yao, Junwei Han, and Tao Mei. Pursuing temporal-consistent video virtual try-on via dynamic pose in- teraction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22648–22657, 2025. 2

  30. [30]

    Magictryon: Harnessing diffusion transformer for garment-preserving video virtual try-on, 2025

    Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, and Peng-Tao Jiang. Magictryon: Harnessing diffusion transformer for garment-preserving video virtual try-on, 2025. 2, 3, 4, 6, 7

  31. [31]

    Realvvt: Towards photorealistic video virtual try-on via spatio-temporal consistency.arXiv preprint arXiv:2501.08682, 2025

    Siqi Li, Zhengkai Jiang, Jiawei Zhou, Zhihong Liu, Xiaowei Chi, and Haoqian Wang. Realvvt: Towards photorealistic video virtual try-on via spatio-temporal consistency.arXiv preprint arXiv:2501.08682, 2025. 2, 3

  32. [32]

    Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want

    Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 3

  33. [33]

    Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025

    Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025. 3

  34. [34]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3

  35. [35]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 4

  36. [36]

    Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

    Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1096–1104, 2016. 3

  37. [37]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  38. [38]

    Llm as dataset ana- lyst: Subpopulation structure discovery with large language model

    Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 3

  39. [39]

    Dress code: High- resolution multi-category virtual try-on

    Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High- resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022. 3

  40. [40]

    Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on

    Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Mar- cella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023. 3

  41. [41]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 5

  42. [42]

    Anilines - anime lineart extractor.https: //github.com/zhenglinpan/AniLines-Anime- Lineart-Extractor, 2025

    Zhenglin Pan. Anilines - anime lineart extractor.https: //github.com/zhenglinpan/AniLines-Anime- Lineart-Extractor, 2025. 4

  43. [43]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  44. [44]

    Towards real- istic data generation for real-world super-resolution.arXiv preprint arXiv:2406.07255, 2024

    Long Peng, Wenbo Li, Renjing Pei, Jingjing Ren, Jiaqi Xu, Yang Wang, Yang Cao, and Zheng-Jun Zha. Towards real- istic data generation for real-world super-resolution.arXiv preprint arXiv:2406.07255, 2024. 3

  45. [45]

    Pixel to gaussian: Ultra-fast continuous super-resolution with 2d gaussian modeling.arXiv preprint arXiv:2503.06617, 2025

    Long Peng, Anran Wu, Wenbo Li, Peizhe Xia, Xueyuan Dai, Xinjie Zhang, Xin Di, Haoze Sun, Renjing Pei, Yang Wang, et al. Pixel to gaussian: Ultra-fast continuous super-resolution with 2d gaussian modeling.arXiv preprint arXiv:2503.06617, 2025. 3

  46. [46]

    Sam 2: Segment anything in images and videos,

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

  47. [47]

    Grounded sam: Assembling open-world models for diverse visual tasks,

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,

  48. [48]

    Fashion customization: Image generation based on editing clue.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4434–4444, 2023

    Dan Song, Jian-Hao Zeng, Min Liu, Xuan-Ya Li, and An- An Liu. Fashion customization: Image generation based on editing clue.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4434–4444, 2023. 2, 3

  49. [49]

    Mef-gd: Mul- timodal enhancement and fusion network for garment de- signer.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Dan Song, Juan Zhou, Jianhao Zeng, HongShuo Tian, Bolun Zheng, Rongbao Kang, and An-An Liu. Mef-gd: Mul- timodal enhancement and fusion network for garment de- signer.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 2

  50. [50]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3

  51. [51]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 2, 5, 6

  52. [52]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

  53. [53]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  54. [54]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 2, 6

  55. [55]

    Detectron2.https://github

    Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.https://github. com/facebookresearch/detectron2, 2019. 4

  56. [56]

    Gp- vton: Towards general purpose virtual try-on via collabora- tive local-flow global-parsing learning

    Zhenyu Xie, Zaiyu Huang, Xin Dong, Fuwei Zhao, Haoye Dong, Xijin Zhang, Feida Zhu, and Xiaodan Liang. Gp- vton: Towards general purpose virtual try-on via collabora- tive local-flow global-parsing learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23550–23559, 2023. 3

  57. [57]

    Scalar: Scale-wise controllable visual autoregressive learning.arXiv preprint arXiv:2507.19946, 2025

    Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, and Xiangxiang Chu. Scalar: Scale-wise controllable visual autoregressive learning.arXiv preprint arXiv:2507.19946, 2025. 3

  58. [58]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3

  59. [59]

    Cat-dm: Controllable acceler- ated virtual try-on with diffusion model

    Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tong- tong Wang, and An-An Liu. Cat-dm: Controllable acceler- ated virtual try-on with diffusion model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8372–8382, 2024. 2, 3

  60. [60]

    Robust-mvton: Learn- ing cross-pose feature alignment and fusion for robust multi- view virtual try-on

    Nannan Zhang, Yijiang Li, Dong Du, Zheng Chong, Zheng- wentai Sun, Jianhao Zeng, Yusheng Dai, Zhengyu Xie, Hairui Zhu, and Xiaoguang Han. Robust-mvton: Learn- ing cross-pose feature alignment and fusion for robust multi- view virtual try-on. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16029–16039,

  61. [61]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 2, 6, 7

  62. [62]

    Viton-dit: Learning in-the-wild video try-on from human dance videos via diffusion transformers.arXiv preprint arXiv:2405.18326, 2024

    Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, and Xi- aodan Liang. Viton-dit: Learning in-the-wild video try-on from human dance videos via diffusion transformers.arXiv preprint arXiv:2405.18326, 2024. 2

  63. [63]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 3

  64. [64]

    Propainter: Improving propagation and transformer for video inpainting

    Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. InProceedings of the IEEE/CVF international conference on computer vision, pages 10477–10486, 2023. 3

  65. [65]

    Dreamvvt: Mastering realis- tic video virtual try-on in the wild via a stage-wise diffusion transformer framework.arXiv preprint arXiv:2508.02807,

    Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, and Xin Dong. Dreamvvt: Mastering realis- tic video virtual try-on in the wild via a stage-wise diffusion transformer framework.arXiv preprint arXiv:2508.02807,