CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Dongmei Jiang; Haoxiang Li; Jun Zheng; Shiyue Zhang; Wenqing Zhang; Xiaodan Liang; Xiao Dong; Yiling Wu; Zheng Chong

arxiv: 2501.11325 · v1 · pith:OXFRQNP4new · submitted 2025-01-20 · 💻 cs.CV · cs.AI

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Zheng Chong , Wenqing Zhang , Shiyue Zhang , Jun Zheng , Xiao Dong , Haoxiang Li , Yiling Wu , Dongmei Jiang

show 1 more author

Xiaodan Liang

This is my paper

classification 💻 cs.CV cs.AI

keywords try-onvideocatv2tonimagevirtualacrosstaskstemporal

0 comments

read the original abstract

Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy
cs.CV 2026-06 unverdicted novelty 7.0

TryOnCrafter is the first DiT-based framework for camera-controllable video virtual try-on via a renderable 4D try-on proxy distilled from 2D priors into 3DGS avatar animated with SMPL-X.
OmniTryOn: Video Try-On Anything at Once!
cs.CV 2026-06 unverdicted novelty 7.0

OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.
TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On
cs.CV 2026-04 unverdicted novelty 7.0

A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
cs.CV 2025-12 unverdicted novelty 6.0

KeyTailor improves video virtual try-on realism by using instruction-guided keyframes to enhance garment details and background integrity in DiT models without major architectural changes.
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
cs.CV 2025-11 unverdicted novelty 6.0

A new dataset with high-fidelity close-up garment images and full/close-up try-on videos plus the VGID metric enables better texture and structure preservation in high-resolution video virtual try-on.
RefTon: Reference person shot assist virtual Try-on
cs.CV 2025-11 unverdicted novelty 6.0

RefTon is a flux-based virtual try-on method that uses unpaired reference images of the target garment on different people to guide texture and detail preservation in a streamlined person-to-person pipeline without bo...