Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
Pith reviewed 2026-05-17 06:30 UTC · model grok-4.3
The pith
A new dataset with high-resolution close-up garment images and real-model videos lets existing video models generate virtual try-on results that preserve fabric textures and details far better.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that by utilizing the detailed images from the introduced dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. The dataset supplies high-fidelity close-up garment images together with textual descriptions and pairs them with both full-shot and close-up try-on videos captured on real human models. The new VGID metric is defined to quantify preservation of both texture and structure in these videos, and benchmarking shows that prior methods still lose fine garment details especially in close-up footage.
What carries the argument
The Eevee dataset of paired high-resolution close-up garment images, text descriptions, and real-model full-shot plus close-up videos, together with the VGID metric that scores joint texture and structural consistency.
If this is right
- Video models conditioned on the dataset's close-up garment images produce virtual try-on footage whose fabric details remain sharper and more consistent across frames.
- The VGID metric ranks current methods by how well they keep both fine texture and overall garment shape intact in close-up sequences.
- Business video production gains a practical route to high-fidelity close-up marketing clips without additional live shoots.
- Benchmark results highlight specific failure modes in texture transfer that future architectures must address.
Where Pith is reading between the lines
- Retail platforms could integrate the dataset into automated video pipelines to reduce the cost of creating detailed product demonstrations.
- The same close-up conditioning strategy might transfer to other video synthesis domains that require preserving fine surface details, such as product repair or material simulation clips.
- Extending the dataset to varied body types, lighting conditions, and garment categories would test how far the texture-extraction benefit generalizes.
Load-bearing premise
The collected close-up videos of real models are representative enough of everyday fashion marketing needs and the VGID score actually tracks the texture and structure qualities that matter to viewers.
What would settle it
Train several recent video models on the new dataset and measure whether close-up output videos show no improvement in texture detail when judged by human raters or by independent texture-matching measures compared with models trained on prior single-image datasets.
Figures
read the original abstract
Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a high-resolution dataset for video-based virtual try-on featuring detailed garment images (including close-ups and textual descriptions) and paired full-shot/close-up try-on videos of real models. It proposes the VGID metric to quantify texture and structure preservation in close-up videos and claims that existing video generation models can extract better texture features from the detailed images, yielding more realistic results; a benchmark of recent models is also presented to highlight current limitations in texture and structural fidelity.
Significance. If the central claims hold, the work addresses a practical gap in fashion e-commerce by shifting focus from full-shot to close-up virtual try-on and supplying both richer input data and a specialized consistency metric. The dataset's dual provision of detailed garment captures and corresponding real-model videos is a clear strength that could enable more faithful texture transfer; the emphasis on a business-relevant evaluation setting is also positive. Significance is tempered by the need for stronger evidence that VGID reliably isolates the claimed texture gains.
major comments (2)
- [VGID definition and experiments] VGID metric (introduced to support the fine-grained evaluation claim): no human correlation study, no comparison against region-masked LPIPS or DINO features, and no ablation isolating texture fidelity from lighting/pose confounds are reported. Without these, lower VGID scores cannot be confidently attributed to the realism gains from the dataset's detailed close-up images, directly weakening the central experimental claim.
- [Experiments and benchmark] Experiments section and benchmark results: the abstract states that detailed images 'significantly enhance' realism and that the benchmark 'identifies' problems, yet no quantitative numbers, error bars, or per-model VGID breakdowns are visible in the provided text. This absence makes it impossible to verify whether the enhancement is load-bearing or merely incremental.
minor comments (2)
- [Metric definition] Notation for VGID components (texture vs. structure terms) should be defined explicitly with equations or pseudocode to avoid ambiguity when readers attempt to reproduce the metric.
- [Dataset description] Dataset statistics table (if present) would benefit from explicit comparison to prior virtual try-on datasets on resolution, close-up coverage, and garment variety to clarify the claimed novelty.
Simulated Author's Rebuttal
Thank you for your detailed and constructive review. We appreciate the feedback highlighting areas where the manuscript can be strengthened, particularly around the validation of the VGID metric and the presentation of experimental results. We address each major comment below and commit to revisions that will improve clarity and rigor without altering the core contributions.
read point-by-point responses
-
Referee: [VGID definition and experiments] VGID metric (introduced to support the fine-grained evaluation claim): no human correlation study, no comparison against region-masked LPIPS or DINO features, and no ablation isolating texture fidelity from lighting/pose confounds are reported. Without these, lower VGID scores cannot be confidently attributed to the realism gains from the dataset's detailed close-up images, directly weakening the central experimental claim.
Authors: We agree that further validation of VGID would strengthen the central claim. In the revised manuscript we will add a human correlation study in which participants rate texture and structure preservation on a subset of generated videos and we report Pearson/Spearman correlations with VGID. We will also include direct comparisons of VGID against region-masked LPIPS and DINO features on the same evaluation set. Finally, we will perform and report controlled ablations that fix lighting and pose while varying only garment texture detail, thereby isolating the contribution of the close-up garment images. These additions will be placed in a new subsection of the experiments. revision: yes
-
Referee: [Experiments and benchmark] Experiments section and benchmark results: the abstract states that detailed images 'significantly enhance' realism and that the benchmark 'identifies' problems, yet no quantitative numbers, error bars, or per-model VGID breakdowns are visible in the provided text. This absence makes it impossible to verify whether the enhancement is load-bearing or merely incremental.
Authors: We apologize that the quantitative evidence was not sufficiently explicit in the narrative. The manuscript already contains VGID scores, per-model comparisons, and benchmark tables in Section 4 and the associated figures. To make these results immediately verifiable, we will expand the experiments section with explicit numerical values, standard-error bars, and a consolidated table that reports VGID (texture and structure components) for every baseline on both full-shot and close-up videos. We will also add a short paragraph quantifying the improvement magnitude (e.g., relative VGID reduction) when detailed garment images are used versus single-image baselines. revision: yes
Circularity Check
No circularity: new dataset and metric introduced as independent contributions
full rationale
The paper's core contributions are the creation of a new high-resolution video try-on dataset containing detailed garment close-ups and paired full/close-up videos, plus the definition of a new VGID metric for evaluating texture and structure preservation. These are presented as empirical resources rather than derived quantities. The claim that detailed images improve texture extraction in existing models is supported by experiments on the new data, not by fitting parameters to a target result or reducing via self-referential equations. No load-bearing step equates a prediction to its own input by construction, and no uniqueness theorem or ansatz is smuggled through self-citation. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Inception-style features can quantify garment texture and structure preservation in video frames.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure... VGID(Is, Iv) = GAP(F′s)·GAP(F′v) / (∥GAP(F′s)∥∥GAP(F′v)∥)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Eevee, a new high-resolution dataset... first dataset to provide both full-shot and close-up videos, and corresponding detailed close-up images
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting
GS-STVSR achieves state-of-the-art continuous spatio-temporal video super-resolution quality with nearly constant inference time at standard scales and over 3x speedup at extreme scales using 2D Gaussian Splatting.
Reference graph
Works this paper leans on
-
[1]
Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024
Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 3
-
[2]
Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186,
-
[5]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inpro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 6
work page 2017
-
[6]
Anyscene: Customized image synthe- sis with composited foreground
Ruidong Chen, Lanjun Wang, Weizhi Nie, Yongdong Zhang, and An-An Liu. Anyscene: Customized image synthe- sis with composited foreground. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8724–8733, 2024. 3
work page 2024
-
[7]
Viton-hd: High-resolution virtual try-on via misalignment-aware normalization
Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14131–14140, 2021. 3
work page 2021
-
[8]
Improving diffusion models for au- thentic virtual try-on in the wild
Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for au- thentic virtual try-on in the wild. InEuropean Conference on Computer Vision, pages 206–235. Springer, 2024. 3
work page 2024
-
[9]
Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, and Xiaodan Liang. Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation. arXiv preprint arXiv:2501.11325, 2025. 2, 3
-
[10]
Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021. 3
work page 2021
-
[11]
Visionllama: A unified llama backbone for vision tasks
Xiangxiang Chu, Jianlin Su, Bo Zhang, and Chunhua Shen. Visionllama: A unified llama backbone for vision tasks. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024. 3
work page 2024
-
[12]
Usp: Unified self-supervised pretraining for image generation and under- standing.ICCV, 2025
Xiangxiang Chu, Renda Li, and Yong Wang. Usp: Unified self-supervised pretraining for image generation and under- standing.ICCV, 2025. 3
work page 2025
-
[13]
Towards multi-pose guided virtual try-on network
Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards multi-pose guided virtual try-on network. InProceedings of the IEEE/CVF international conference on computer vision, pages 9026–9035, 2019. 3
work page 2019
-
[14]
Fw-gan: Flow-navigated warping gan for video virtual try-on
Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. Fw-gan: Flow-navigated warping gan for video virtual try-on. InProceedings of the IEEE/CVF international conference on computer vision, pages 1161–1170, 2019. 2, 3
work page 2019
-
[15]
Vivid: Video virtual try-on using diffusion models.arXiv preprint arXiv:2405.11794, 2024
Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, and Zheng- Jun Zha. Vivid: Video virtual try-on using diffusion models. arXiv preprint arXiv:2405.11794, 2024. 2, 3, 6, 7
-
[16]
Taming the power of diffusion models for high-quality virtual try-on with appearance flow
Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. InProceedings of the 31st ACM International Conference on Multimedia, pages 7599–7607, 2023. 3
work page 2023
-
[17]
Densepose: Dense human pose estimation in the wild
Rıza Alp G ¨uler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018. 4
work page 2018
-
[18]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Viton: An image-based virtual try-on network
Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018. 3
work page 2018
-
[20]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 6
work page 2018
-
[21]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3
work page 2020
-
[22]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6, 7
work page 2022
-
[23]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 2, 5, 6, 7
work page 2024
-
[24]
Doubleu-net: A deep convolutional neural network for medical image segmen- tation
Debesh Jha, Michael A Riegler, Dag Johansen, P ˚al Halvorsen, and H ˚avard D Johansen. Doubleu-net: A deep convolutional neural network for medical image segmen- tation. In2020 IEEE 33rd International symposium on computer-based medical systems (CBMS), pages 558–564. IEEE, 2020. 3
work page 2020
-
[25]
Cloth- former: Taming video virtual try-on in all module
Jianbin Jiang, Tan Wang, He Yan, and Junhui Liu. Cloth- former: Taming video virtual try-on in all module. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10799–10808, 2022. 2
work page 2022
-
[26]
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 2, 3, 4, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on
Jeongho Kim, Guojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspon- dence with latent diffusion model for virtual try-on. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8176–8185, 2024. 3
work page 2024
-
[28]
Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Lei Sun, and Xiangxiang Chu. Flux-text: A simple and advanced diffusion transformer baseline for scene text editing.arXiv preprint arXiv:2505.03329, 2025. 3
-
[29]
Pursuing temporal-consistent video virtual try-on via dynamic pose in- teraction
Dong Li, Wenqi Zhong, Wei Yu, Yingwei Pan, Dingwen Zhang, Ting Yao, Junwei Han, and Tao Mei. Pursuing temporal-consistent video virtual try-on via dynamic pose in- teraction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22648–22657, 2025. 2
work page 2025
-
[30]
Magictryon: Harnessing diffusion transformer for garment-preserving video virtual try-on, 2025
Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, and Peng-Tao Jiang. Magictryon: Harnessing diffusion transformer for garment-preserving video virtual try-on, 2025. 2, 3, 4, 6, 7
work page 2025
-
[31]
Siqi Li, Zhengkai Jiang, Jiawei Zhou, Zhihong Liu, Xiaowei Chi, and Haoqian Wang. Realvvt: Towards photorealistic video virtual try-on via spatio-temporal consistency.arXiv preprint arXiv:2501.08682, 2025. 2, 3
-
[32]
Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want
Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 3
-
[33]
Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025
Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos, 2025. 3
work page 2025
-
[34]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Deepfashion: Powering robust clothes recognition and retrieval with rich annotations
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1096–1104, 2016. 3
work page 2016
-
[37]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Llm as dataset ana- lyst: Subpopulation structure discovery with large language model
Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 3
work page 2024
-
[39]
Dress code: High- resolution multi-category virtual try-on
Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High- resolution multi-category virtual try-on. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2231–2235, 2022. 3
work page 2022
-
[40]
Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on
Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Mar- cella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM international conference on multimedia, pages 8580–8589, 2023. 3
work page 2023
-
[41]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Zhenglin Pan. Anilines - anime lineart extractor.https: //github.com/zhenglinpan/AniLines-Anime- Lineart-Extractor, 2025. 4
work page 2025
-
[43]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[44]
Long Peng, Wenbo Li, Renjing Pei, Jingjing Ren, Jiaqi Xu, Yang Wang, Yang Cao, and Zheng-Jun Zha. Towards real- istic data generation for real-world super-resolution.arXiv preprint arXiv:2406.07255, 2024. 3
-
[45]
Long Peng, Anran Wu, Wenbo Li, Peizhe Xia, Xueyuan Dai, Xinjie Zhang, Xin Di, Haoze Sun, Renjing Pei, Yang Wang, et al. Pixel to gaussian: Ultra-fast continuous super-resolution with 2d gaussian modeling.arXiv preprint arXiv:2503.06617, 2025. 3
-
[46]
Sam 2: Segment anything in images and videos,
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,
-
[47]
Grounded sam: Assembling open-world models for diverse visual tasks,
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,
-
[48]
Dan Song, Jian-Hao Zeng, Min Liu, Xuan-Ya Li, and An- An Liu. Fashion customization: Image generation based on editing clue.IEEE Transactions on Circuits and Systems for Video Technology, 34(6):4434–4444, 2023. 2, 3
work page 2023
-
[49]
Dan Song, Juan Zhou, Jianhao Zeng, HongShuo Tian, Bolun Zheng, Rongbao Kang, and An-An Liu. Mef-gd: Mul- timodal enhancement and fusion network for garment de- signer.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 2
work page 2025
-
[50]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[51]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 2, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[52]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3
work page 2017
-
[53]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 2, 6
work page 2004
-
[55]
Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.https://github. com/facebookresearch/detectron2, 2019. 4
work page 2019
-
[56]
Zhenyu Xie, Zaiyu Huang, Xin Dong, Fuwei Zhao, Haoye Dong, Xijin Zhang, Feida Zhu, and Xiaodan Liang. Gp- vton: Towards general purpose virtual try-on via collabora- tive local-flow global-parsing learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23550–23559, 2023. 3
work page 2023
-
[57]
Scalar: Scale-wise controllable visual autoregressive learning.arXiv preprint arXiv:2507.19946, 2025
Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, and Xiangxiang Chu. Scalar: Scale-wise controllable visual autoregressive learning.arXiv preprint arXiv:2507.19946, 2025. 3
-
[58]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Cat-dm: Controllable acceler- ated virtual try-on with diffusion model
Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tong- tong Wang, and An-An Liu. Cat-dm: Controllable acceler- ated virtual try-on with diffusion model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8372–8382, 2024. 2, 3
work page 2024
-
[60]
Nannan Zhang, Yijiang Li, Dong Du, Zheng Chong, Zheng- wentai Sun, Jianhao Zeng, Yusheng Dai, Zhengyu Xie, Hairui Zhu, and Xiaoguang Han. Robust-mvton: Learn- ing cross-pose feature alignment and fusion for robust multi- view virtual try-on. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16029–16039,
-
[61]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 2, 6, 7
work page 2018
-
[62]
Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, and Xi- aodan Liang. Viton-dit: Learning in-the-wild video try-on from human dance videos via diffusion transformers.arXiv preprint arXiv:2405.18326, 2024. 2
-
[63]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Propainter: Improving propagation and transformer for video inpainting
Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. InProceedings of the IEEE/CVF international conference on computer vision, pages 10477–10486, 2023. 3
work page 2023
-
[65]
Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, and Xin Dong. Dreamvvt: Mastering realis- tic video virtual try-on in the wild via a stage-wise diffusion transformer framework.arXiv preprint arXiv:2508.02807,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.