arxiv: 2511.18957 · v2 · submitted 2025-11-24 · 💻 cs.CV

Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

Jianhao Zeng , Yancheng Bai , Ruidong Chen , Xuanpu Zhang , Lei Sun , Dongyang Jin , Ryan Xu , Nannan Zhang

show 2 more authors

Dan Song Xiangxiang Chu

This is my paper

Pith reviewed 2026-05-17 06:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords video virtual try-onclose-up video generationgarment texture preservationvirtual try-on datasetfashion video synthesisconsistency metrichigh-resolution garment images

0 comments

The pith

A new dataset with high-resolution close-up garment images and real-model videos lets existing video models generate virtual try-on results that preserve fabric textures and details far better.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current virtual try-on systems fall short for business needs because they train on single flat garment photos and produce only full-body shots. To fix this, the authors release a dataset that pairs detailed close-up garment photographs and text descriptions with actual full-shot and close-up videos of real people wearing the garments. Experiments show that feeding these richer images into off-the-shelf video generation models allows the models to pull out and apply fine texture information, raising the visual quality of close-up results. A new evaluation score called VGID is introduced to measure how consistently both texture and overall garment structure survive the generation process. If the approach works, marketers could create convincing close-up product videos without repeated physical filming.

Core claim

The central claim is that by utilizing the detailed images from the introduced dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. The dataset supplies high-fidelity close-up garment images together with textual descriptions and pairs them with both full-shot and close-up try-on videos captured on real human models. The new VGID metric is defined to quantify preservation of both texture and structure in these videos, and benchmarking shows that prior methods still lose fine garment details especially in close-up footage.

What carries the argument

The Eevee dataset of paired high-resolution close-up garment images, text descriptions, and real-model full-shot plus close-up videos, together with the VGID metric that scores joint texture and structural consistency.

If this is right

Video models conditioned on the dataset's close-up garment images produce virtual try-on footage whose fabric details remain sharper and more consistent across frames.
The VGID metric ranks current methods by how well they keep both fine texture and overall garment shape intact in close-up sequences.
Business video production gains a practical route to high-fidelity close-up marketing clips without additional live shoots.
Benchmark results highlight specific failure modes in texture transfer that future architectures must address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Retail platforms could integrate the dataset into automated video pipelines to reduce the cost of creating detailed product demonstrations.
The same close-up conditioning strategy might transfer to other video synthesis domains that require preserving fine surface details, such as product repair or material simulation clips.
Extending the dataset to varied body types, lighting conditions, and garment categories would test how far the texture-extraction benefit generalizes.

Load-bearing premise

The collected close-up videos of real models are representative enough of everyday fashion marketing needs and the VGID score actually tracks the texture and structure qualities that matter to viewers.

What would settle it

Train several recent video models on the new dataset and measure whether close-up output videos show no improvement in texture detail when judged by human raters or by independent texture-matching measures compared with models trained on prior single-image datasets.

Figures

Figures reproduced from arXiv: 2511.18957 by Dan Song, Dongyang Jin, Jianhao Zeng, Lei Sun, Nannan Zhang, Ruidong Chen, Ryan Xu, Xiangxiang Chu, Xuanpu Zhang, Yancheng Bai.

**Figure 1.** Figure 1: An illustrative sample from our proposed virtual try-on dataset. Compared to existing datasets, ours provides a richer collection [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: A comparison of garment image and video frame, along [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The overall fine-tuning pipeline of the VACE on our dataset. Compared to other video-based virtual try-on models, our VACE [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: VBench evaluation results of video-based virtual try-on [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of close-up video virtual try-on results on the Eevee dataset. Our fine-tuned VACE model not only [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of the ablation study. The model combining LoRA fine-tuning and detailed images generates textures [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a practical dataset with close-up garment details and paired videos plus a new VGID metric, but the metric lacks clear validation and the results stay light on numbers.

read the letter

The main point is that this paper supplies a new high-resolution dataset for video virtual try-on that includes detailed close-up garment images with text descriptions, plus real-model videos in both full and close-up views, and it proposes VGID as a metric for texture and structure consistency in those close-ups. That combination targets a real limitation in current work, where single garment photos lose fine details and methods ignore the close-up shots that matter for e-commerce marketing videos. The experiments show existing video models can pull better texture features from the richer inputs and produce more realistic outputs, and the benchmark flags specific weaknesses in recent methods on preservation tasks. Those are concrete steps forward for an applied corner of the field. The dataset and the paired video setup stand out as the strongest parts because they directly match business needs rather than just adding another synthetic benchmark. The VGID metric is a reasonable attempt to handle the higher bar for close-up evaluation. The soft spots sit mainly in the evaluation side. The abstract claims the experiments validate the contributions and that the benchmark identifies problems, yet it gives no quantitative scores, ablations, or breakdowns. More importantly, VGID is introduced to measure what matters in close-ups, but there is no reported check against human judgments or tests against simpler alternatives such as masked LPIPS on garment regions. Without that link, it is hard to know whether changes in VGID scores actually track the claimed realism gains or just reflect other factors like lighting. If the full paper includes data release details and those missing comparisons, the case strengthens. This work is aimed at people building or evaluating virtual try-on tools for fashion e-commerce rather than readers chasing broad theoretical advances in video generation. A researcher who needs new data resources or wants to test methods on close-up fidelity would find value here, assuming the dataset ships with the paper. It deserves a serious referee because the practical gap it fills is clear and the subfield is active enough to benefit from new resources, even if the metric side needs tightening. I would send it for peer review with a request for quantitative tables and some human correlation on VGID.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a high-resolution dataset for video-based virtual try-on featuring detailed garment images (including close-ups and textual descriptions) and paired full-shot/close-up try-on videos of real models. It proposes the VGID metric to quantify texture and structure preservation in close-up videos and claims that existing video generation models can extract better texture features from the detailed images, yielding more realistic results; a benchmark of recent models is also presented to highlight current limitations in texture and structural fidelity.

Significance. If the central claims hold, the work addresses a practical gap in fashion e-commerce by shifting focus from full-shot to close-up virtual try-on and supplying both richer input data and a specialized consistency metric. The dataset's dual provision of detailed garment captures and corresponding real-model videos is a clear strength that could enable more faithful texture transfer; the emphasis on a business-relevant evaluation setting is also positive. Significance is tempered by the need for stronger evidence that VGID reliably isolates the claimed texture gains.

major comments (2)

[VGID definition and experiments] VGID metric (introduced to support the fine-grained evaluation claim): no human correlation study, no comparison against region-masked LPIPS or DINO features, and no ablation isolating texture fidelity from lighting/pose confounds are reported. Without these, lower VGID scores cannot be confidently attributed to the realism gains from the dataset's detailed close-up images, directly weakening the central experimental claim.
[Experiments and benchmark] Experiments section and benchmark results: the abstract states that detailed images 'significantly enhance' realism and that the benchmark 'identifies' problems, yet no quantitative numbers, error bars, or per-model VGID breakdowns are visible in the provided text. This absence makes it impossible to verify whether the enhancement is load-bearing or merely incremental.

minor comments (2)

[Metric definition] Notation for VGID components (texture vs. structure terms) should be defined explicitly with equations or pseudocode to avoid ambiguity when readers attempt to reproduce the metric.
[Dataset description] Dataset statistics table (if present) would benefit from explicit comparison to prior virtual try-on datasets on resolution, close-up coverage, and garment variety to clarify the claimed novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your detailed and constructive review. We appreciate the feedback highlighting areas where the manuscript can be strengthened, particularly around the validation of the VGID metric and the presentation of experimental results. We address each major comment below and commit to revisions that will improve clarity and rigor without altering the core contributions.

read point-by-point responses

Referee: [VGID definition and experiments] VGID metric (introduced to support the fine-grained evaluation claim): no human correlation study, no comparison against region-masked LPIPS or DINO features, and no ablation isolating texture fidelity from lighting/pose confounds are reported. Without these, lower VGID scores cannot be confidently attributed to the realism gains from the dataset's detailed close-up images, directly weakening the central experimental claim.

Authors: We agree that further validation of VGID would strengthen the central claim. In the revised manuscript we will add a human correlation study in which participants rate texture and structure preservation on a subset of generated videos and we report Pearson/Spearman correlations with VGID. We will also include direct comparisons of VGID against region-masked LPIPS and DINO features on the same evaluation set. Finally, we will perform and report controlled ablations that fix lighting and pose while varying only garment texture detail, thereby isolating the contribution of the close-up garment images. These additions will be placed in a new subsection of the experiments. revision: yes
Referee: [Experiments and benchmark] Experiments section and benchmark results: the abstract states that detailed images 'significantly enhance' realism and that the benchmark 'identifies' problems, yet no quantitative numbers, error bars, or per-model VGID breakdowns are visible in the provided text. This absence makes it impossible to verify whether the enhancement is load-bearing or merely incremental.

Authors: We apologize that the quantitative evidence was not sufficiently explicit in the narrative. The manuscript already contains VGID scores, per-model comparisons, and benchmark tables in Section 4 and the associated figures. To make these results immediately verifiable, we will expand the experiments section with explicit numerical values, standard-error bars, and a consolidated table that reports VGID (texture and structure components) for every baseline on both full-shot and close-up videos. We will also add a short paragraph quantifying the improvement magnitude (e.g., relative VGID reduction) when detailed garment images are used versus single-image baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: new dataset and metric introduced as independent contributions

full rationale

The paper's core contributions are the creation of a new high-resolution video try-on dataset containing detailed garment close-ups and paired full/close-up videos, plus the definition of a new VGID metric for evaluating texture and structure preservation. These are presented as empirical resources rather than derived quantities. The claim that detailed images improve texture extraction in existing models is supported by experiments on the new data, not by fitting parameters to a target result or reducing via self-referential equations. No load-bearing step equates a prediction to its own input by construction, and no uniqueness theorem or ansatz is smuggled through self-citation. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard computer-vision assumptions about feature extraction and metric validity rather than new free parameters or invented entities.

axioms (1)

domain assumption Inception-style features can quantify garment texture and structure preservation in video frames.
Invoked when defining the VGID metric for close-up consistency evaluation.

pith-pipeline@v0.9.0 · 5581 in / 1211 out tokens · 80553 ms · 2026-05-17T06:30:11.874922+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure... VGID(Is, Iv) = GAP(F′s)·GAP(F′v) / (∥GAP(F′s)∥∥GAP(F′v)∥)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Eevee, a new high-resolution dataset... first dataset to provide both full-shot and close-up videos, and corresponding detailed close-up images

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 7.0

GS-STVSR achieves state-of-the-art continuous spatio-temporal video super-resolution quality with nearly constant inference time at standard scales and over 3x speedup at extreme scales using 2D Gaussian Splatting.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 3

work page arXiv 2024
[2]

Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3

work page arXiv 2025
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186,

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186,

work page
[5]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inpro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 6

work page 2017
[6]

Anyscene: Customized image synthe- sis with composited foreground

Ruidong Chen, Lanjun Wang, Weizhi Nie, Yongdong Zhang, and An-An Liu. Anyscene: Customized image synthe- sis with composited foreground. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8724–8733, 2024. 3

work page 2024
[7]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021. 3

work page 2021
[8]

Improving diffusion models for au- thentic virtual try-on in the wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for au- thentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024. 3

work page 2024
[9]

Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation.arXiv preprint arXiv:2501.11325, 2025

Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation. arXiv preprint arXiv:2501.11325, 2025. 2, 3

work page arXiv 2025
[10]

Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021

Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021. 3

work page 2021
[11]

Visionllama: A unified llama backbone for vision tasks

Xiangxiang Chu, Jianlin Su, Bo Zhang, and Chunhua Shen. Visionllama: A unified llama backbone for vision tasks. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024. 3

work page 2024
[12]

Usp: Unified self-supervised pretraining for image generation and under- standing.ICCV, 2025

Xiangxiang Chu, Renda Li, and Yong Wang. Usp: Unified self-supervised pretraining for image generation and under- standing.ICCV, 2025. 3

work page 2025
[13]

Towards multi-pose guided virtual try-on network

Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards multi-pose guided virtual try-on network. InProceedings of the IEEE/CVF international conference on computer vision, pages 9026–9035, 2019. 3

work page 2019
[14]

Fw-gan: Flow-navigated warping gan for video virtual try-on

Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. Fw-gan: Flow-navigated warping gan for video virtual try-on. InProceedings of the IEEE/CVF international conference on computer vision, pages 1161–1170, 2019. 2, 3

work page 2019
[15]

Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794, 2024

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng- Jun Zha. Vivid: Video virtual try-on using diffusion models. arXiv preprint arXiv:2405.11794, 2024. 2, 3, 6, 7

work page arXiv 2024
[16]

Taming the power of diffusion models for high-quality virtual try-on with appearance flow

Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. InProceedings of the 31st ACM International Conference on Multimedia, pages 7599–7607, 2023. 3

work page 2023
[17]

Densepose: Dense human pose estimation in the wild

Rıza Alp G ¨uler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018. 4

work page 2018
[18]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Viton: An image-based virtual try-on network

Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018. 3

work page 2018
[20]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 6

work page 2018
[21]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020
[22]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6, 7

work page 2022
[23]

Vbench: Comprehensive bench- mark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 2, 5, 6, 7

work page 2024
[24]

Doubleu-net: A deep convolutional neural network for medical image segmen- tation

Debesh Jha, Michael A Riegler, Dag Johansen, P ˚al Halvorsen, and H ˚avard D Johansen. Doubleu-net: A deep convolutional neural network for medical image segmen- tation. In2020 IEEE 33rd International symposium on computer-based medical systems (CBMS), pages 558–564. IEEE, 2020. 3

work page 2020
[25]

Cloth- former: Taming video virtual try-on in all module

Jianbin Jiang, Tan Wang, He Yan, and Junhui Liu. Cloth- former: Taming video virtual try-on in all module. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10799–10808, 2022. 2

work page 2022
[26]

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 2, 3, 4, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on

Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024. 3

work page 2024
[28]

Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025

Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Lei Sun, and Xiangxiang Chu. Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025. 3

work page arXiv 2025
[29]

Pursuing temporal-consistent video virtual try-on via dynamic pose in- teraction

Dong Li, Wenqi Zhong, Wei Yu, Yingwei Pan, Dingwen Zhang, Ting Yao, Junwei Han, and Tao Mei. Pursuing temporal-consistent video virtual try-on via dynamic pose in- teraction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22648–22657, 2025. 2

work page 2025
[30]

Magictryon: Harnessing diffusion transformer for garment-preserving video virtual try-on, 2025

Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, and Peng-Tao Jiang. Magictryon: Harnessing diffusion transformer for garment-preserving video virtual try-on, 2025. 2, 3, 4, 6, 7

work page 2025
[31]

Realvvt: Towards photorealistic video virtual try-on via spatio-temporal consistency.arXiv preprint arXiv:2501.08682, 2025

Siqi Li, Zhengkai Jiang, Jiawei Zhou, Zhihong Liu, Xiaowei Chi, and Haoqian Wang. Realvvt: Towards photorealistic video virtual try-on via spatio-temporal consistency.arXiv preprint arXiv:2501.08682, 2025. 2, 3

work page arXiv 2025
[32]

Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 3

work page arXiv 2024
[33]

Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025. 3

work page 2025
[34]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1096–1104, 2016. 3

work page 2016
[37]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Llm as dataset ana- lyst: Subpopulation structure discovery with large language model

Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 3

work page 2024
[39]

Dress code: High- resolution multi-category virtual try-on

Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High- resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022. 3

work page 2022
[40]

Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on

Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Mar- cella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023. 3

work page 2023
[41]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Anilines - anime lineart extractor.https: //github.com/zhenglinpan/AniLines-Anime- Lineart-Extractor, 2025

Zhenglin Pan. Anilines - anime lineart extractor.https: //github.com/zhenglinpan/AniLines-Anime- Lineart-Extractor, 2025. 4

work page 2025
[43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[44]

Towards real- istic data generation for real-world super-resolution.arXiv preprint arXiv:2406.07255, 2024

Long Peng, Wenbo Li, Renjing Pei, Jingjing Ren, Jiaqi Xu, Yang Wang, Yang Cao, and Zheng-Jun Zha. Towards real- istic data generation for real-world super-resolution.arXiv preprint arXiv:2406.07255, 2024. 3

work page arXiv 2024
[45]

Pixel to gaussian: Ultra-fast continuous super-resolution with 2d gaussian modeling.arXiv preprint arXiv:2503.06617, 2025

Long Peng, Anran Wu, Wenbo Li, Peizhe Xia, Xueyuan Dai, Xinjie Zhang, Xin Di, Haoze Sun, Renjing Pei, Yang Wang, et al. Pixel to gaussian: Ultra-fast continuous super-resolution with 2d gaussian modeling.arXiv preprint arXiv:2503.06617, 2025. 3

work page arXiv 2025
[46]

Sam 2: Segment anything in images and videos,

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

work page
[47]

Grounded sam: Assembling open-world models for diverse visual tasks,

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,

work page
[48]

Fashion customization: Image generation based on editing clue.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4434–4444, 2023

Dan Song, Jian-Hao Zeng, Min Liu, Xuan-Ya Li, and An- An Liu. Fashion customization: Image generation based on editing clue.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4434–4444, 2023. 2, 3

work page 2023
[49]

Mef-gd: Mul- timodal enhancement and fusion network for garment de- signer.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Dan Song, Juan Zhou, Jianhao Zeng, HongShuo Tian, Bolun Zheng, Rongbao Kang, and An-An Liu. Mef-gd: Mul- timodal enhancement and fusion network for garment de- signer.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 2

work page 2025
[50]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[51]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[52]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

work page 2017
[53]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 2, 6

work page 2004
[55]

Detectron2.https://github

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.https://github. com/facebookresearch/detectron2, 2019. 4

work page 2019
[56]

Gp- vton: Towards general purpose virtual try-on via collabora- tive local-flow global-parsing learning

Zhenyu Xie, Zaiyu Huang, Xin Dong, Fuwei Zhao, Haoye Dong, Xijin Zhang, Feida Zhu, and Xiaodan Liang. Gp- vton: Towards general purpose virtual try-on via collabora- tive local-flow global-parsing learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23550–23559, 2023. 3

work page 2023
[57]

Scalar: Scale-wise controllable visual autoregressive learning.arXiv preprint arXiv:2507.19946, 2025

Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, and Xiangxiang Chu. Scalar: Scale-wise controllable visual autoregressive learning.arXiv preprint arXiv:2507.19946, 2025. 3

work page arXiv 2025
[58]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Cat-dm: Controllable acceler- ated virtual try-on with diffusion model

Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tong- tong Wang, and An-An Liu. Cat-dm: Controllable acceler- ated virtual try-on with diffusion model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8372–8382, 2024. 2, 3

work page 2024
[60]

Robust-mvton: Learn- ing cross-pose feature alignment and fusion for robust multi- view virtual try-on

Nannan Zhang, Yijiang Li, Dong Du, Zheng Chong, Zheng- wentai Sun, Jianhao Zeng, Yusheng Dai, Zhengyu Xie, Hairui Zhu, and Xiaoguang Han. Robust-mvton: Learn- ing cross-pose feature alignment and fusion for robust multi- view virtual try-on. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16029–16039,

work page
[61]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 2, 6, 7

work page 2018
[62]

Viton-dit: Learning in-the-wild video try-on from human dance videos via diffusion transformers.arXiv preprint arXiv:2405.18326, 2024

Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, and Xi- aodan Liang. Viton-dit: Learning in-the-wild video try-on from human dance videos via diffusion transformers.arXiv preprint arXiv:2405.18326, 2024. 2

work page arXiv 2024
[63]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Propainter: Improving propagation and transformer for video inpainting

Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. InProceedings of the IEEE/CVF international conference on computer vision, pages 10477–10486, 2023. 3

work page 2023
[65]

Dreamvvt: Mastering realis- tic video virtual try-on in the wild via a stage-wise diffusion transformer framework.arXiv preprint arXiv:2508.02807,

Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, and Xin Dong. Dreamvvt: Mastering realis- tic video virtual try-on in the wild via a stage-wise diffusion transformer framework.arXiv preprint arXiv:2508.02807,

work page arXiv