pith. sign in

arxiv: 2501.11325 · v1 · pith:OXFRQNP4new · submitted 2025-01-20 · 💻 cs.CV · cs.AI

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

classification 💻 cs.CV cs.AI
keywords try-onvideocatv2tonimagevirtualacrosstaskstemporal
0
0 comments X
read the original abstract

Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy

    cs.CV 2026-06 unverdicted novelty 7.0

    TryOnCrafter is the first DiT-based framework for camera-controllable video virtual try-on via a renderable 4D try-on proxy distilled from 2D priors into 3DGS avatar animated with SMPL-X.

  2. OmniTryOn: Video Try-On Anything at Once!

    cs.CV 2026-06 unverdicted novelty 7.0

    OmniTryOn performs multi-object video virtual try-on in one pass using first-frame wearable caching and spatiotemporal RoPE, outperforming single-garment baselines on a new TryAny-Bench dataset.

  3. TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

    cs.CV 2026-04 unverdicted novelty 7.0

    A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.

  4. The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection

    cs.CV 2025-12 unverdicted novelty 6.0

    KeyTailor improves video virtual try-on realism by using instruction-guided keyframes to enhance garment details and background integrity in DiT models without major architectural changes.

  5. Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

    cs.CV 2025-11 unverdicted novelty 6.0

    A new dataset with high-fidelity close-up garment images and full/close-up try-on videos plus the VGID metric enables better texture and structure preservation in high-resolution video virtual try-on.

  6. RefTon: Reference person shot assist virtual Try-on

    cs.CV 2025-11 unverdicted novelty 6.0

    RefTon is a flux-based virtual try-on method that uses unpaired reference images of the target garment on different people to guide texture and detail preservation in a streamlined person-to-person pipeline without bo...