iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Bo Zheng; Jing Wang; Jinsong Lan; Jun Zheng; Kaifu Zhang; Mengting Chen; Xiaodan Liang; Xiaoyong Zhu; Zhengze Xu

arxiv: 2605.21431 · v1 · pith:QID2FV3Bnew · submitted 2026-05-20 · 💻 cs.CV

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Jun Zheng , Zhengze Xu , Mengting Chen , Jing Wang , Jinsong Lan , Xiaoyong Zhu , Kaifu Zhang , Bo Zheng

show 1 more author

Xiaodan Liang

This is my paper

Pith reviewed 2026-05-21 05:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords Video Virtual Try-OnInteractive Video GenerationDiffusion ModelsHuman-Garment InteractionSpatial GuidanceSemantic GuidanceVirtual Try-OnAction Captions

0 comments

The pith

A framework called iTryOn uses 3D hand guidance and timed action descriptions to enable realistic virtual garment changes in videos where people interact with their clothes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Interactive Video Virtual Try-On, a task where video subjects actively engage with clothing rather than just posing. Existing methods handle static showcases but struggle with dynamic interactions like pulling or adjusting garments. iTryOn addresses this by injecting spatial guidance from hand positions and semantic guidance from action captions into a video diffusion model. If successful, this would allow more natural and controllable try-on experiences for e-commerce and fashion applications. The approach combines a garment-agnostic 3D hand prior with synchronized position embeddings to handle sparse interaction moments in training data.

Core claim

iTryOn pioneers a multi-level interaction injection mechanism within a large-scale video diffusion Transformer to guide the generation of complex garment dynamics in interactive scenarios. At the spatial level, a garment-agnostic 3D hand prior provides fine-grained guidance for precise hand-garment contact. At the semantic level, global captions and time-stamped action captions are synchronized via Action-aware Rotational Position Embedding (A-RoPE) to resolve ambiguities and learn deformations from brief interactive moments.

What carries the argument

The multi-level interaction injection mechanism, which combines a garment-agnostic 3D hand prior for spatial guidance and Action-aware Rotational Position Embedding (A-RoPE) for synchronizing time-stamped action captions.

If this is right

iTryOn achieves state-of-the-art performance on traditional non-interactive VVT benchmarks.
iTryOn establishes a commanding lead in the new interactive VVT setting.
The framework enables more dynamic and controllable virtual try-on experiences.
Resolves semantic ambiguity of interactions and complex garment deformations in videos with sparse interactive moments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this to real-time applications could allow live virtual try-on during video calls or social media.
Similar spatial-semantic guidance might apply to other video editing tasks involving human-object interactions, such as object manipulation in augmented reality.
Future work could test if the hand prior generalizes to different garment types without additional training.
Integration with user-controlled inputs might enable personalized interaction styles in try-on videos.

Load-bearing premise

That combining a garment-agnostic 3D hand prior with time-stamped action captions can sufficiently resolve semantic ambiguity and complex deformations even when interactive moments are sparse and brief in training videos.

What would settle it

A test video sequence with complex hand-garment interactions that are not captured well by 3D hand priors or action captions, where the generated output shows incorrect deformations or misplaced contacts compared to ground truth.

Figures

Figures reproduced from arXiv: 2605.21431 by Bo Zheng, Jing Wang, Jinsong Lan, Jun Zheng, Kaifu Zhang, Mengting Chen, Xiaodan Liang, Xiaoyong Zhu, Zhengze Xu.

**Figure 1.** Figure 1: iTryOn synthesizes a diverse range of complex human-garment interactions guided by action captions. The examples showcase the model’s ability to generate physically plausible deformations for various actions. (Best viewed in motion in the supplementary videos) adapted powerful pre-trained diffusion models by incorporating temporal modules. These approaches leverage the strong priors learned from large-sca… view at source ↗

**Figure 2.** Figure 2: The iTryOn architecture. (a) A DiT backbone with parallel injection of general context and 3D-hand guidance from our Interaction Guider. An action-aware constraint loss focuses training on interaction frames. (b) The Interaction Guider module fuses spatial features with global and action-specific text prompts. (c) Our A-RoPE mechanism aligns action captions to their corresponding video segments via unique … view at source ↗

**Figure 3.** Figure 3: Visual justification for our garment-agnostic 3D hand prior. Deriving a ”Hand Depth” prior from human parsing (Li et al., 2022) and video depth (Chen et al., 2025) suffers from critical information leakage. This flawed prior improperly retains source garment geometry, such as the sleeve cuff, leading directly to visible artifacts in the generated output. In contrast, our fully garment-agnostic 3D hand prio… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on the VVT-Interact dataset. 4.2. Implementation Details Our model is initialized from the pre-trained Wan2.1-VACE (Jiang et al., 2025) and trained using a two-stage scheme. In the first stage, we finetune the model on the ViViD dataset for 10k steps using empty action captions (i.e., treating all samples as non-interactive). After this stage, we evaluate the model on the ViViD-S-Tes… view at source ↗

**Figure 5.** Figure 5: Visual comparison of different variants on the VVT-Interact dataset. implausible deformations (e.g., ViViD) or completely misinterpret the action, producing a simple hand-gliding motion without engaging the garment (e.g., CatV2TON, MagicTryOn). Similarly, for a hem-pulling action, they often render a static unresponsive garment. In contrast, iTryOn is the only method that successfully synthesizes these i… view at source ↗

**Figure 6.** Figure 6: Visual motivation for our Action-aware Semantic Guidance. These examples from our VVT-Interact dataset highlight the semantic ambiguity of VLM-generated global captions. Although the ground-truth interactions are distinct (rolling sleeves vs. adjusting the hem), both are imprecisely described with the generic verb ”adjusts”. Our categorical action captions resolve this ambiguity, providing the model with a… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on the ViViD dataset. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iTryOn carves out interactive video virtual try-on as a distinct task and adds a garment-agnostic 3D hand prior plus A-RoPE, but the evidence that these fix sparse-interaction deformations is still thin.

read the letter

iTryOn is mainly interesting because it names and formalizes Interactive Video Virtual Try-On, where the person in the video actually grabs, pulls, or adjusts the garment instead of just modeling it. The authors build on a video diffusion transformer and inject guidance at two levels: a garment-agnostic 3D hand prior for spatial contact points and time-stamped action captions fed through their Action-aware Rotational Position Embedding (A-RoPE) for semantic timing. That combination is a direct response to the two problems they flag—semantic ambiguity from pose alone and learning deformations when real interaction frames are rare and short. The abstract reports SOTA on standard VVT benchmarks plus a clear lead on the new interactive setting, which is the concrete payoff they advertise. The task definition and the two injection mechanisms are the parts that feel genuinely new relative to the non-interactive VVT papers they cite. The multi-level split (spatial prior plus semantic timing) is a clean way to organize the guidance, and it is easy to see why someone working on controllable video generation would pick it up. The soft spot is the one the stress-test note raises. Because the hand prior is deliberately garment-agnostic, it supplies pose and contact geometry without material or topology information. Nothing in the abstract describes an extra garment-conditioned deformation field or per-contact loss that would teach the model how a specific shirt actually folds when a hand tugs it. With interactions described as sparse and brief, it is not obvious the diffusion transformer will pick up those garment-specific dynamics from the prior alone. The reported gains could therefore be driven more by the base model or the action captions than by the new spatial-semantic injection. Without the full experiments, ablations, or error analysis it is hard to judge how much the proposed components actually move the needle. This paper is for people already working on video virtual try-on or controllable video diffusion. A reader who needs a new benchmark or a practical way to add hand-garment interaction will get value from the task setup and the injection design. It is worth sending to a serious referee because the task is well-motivated and the mechanisms are explicit, even if the current evidence for the central claim needs more scrutiny on the deformation side.

Referee Report

1 major / 2 minor

Summary. The manuscript formalizes the new task of Interactive Video Virtual Try-On (Interactive VVT), where subjects actively interact with garments in video, and proposes iTryOn, a video diffusion Transformer framework. It introduces multi-level interaction injection: a garment-agnostic 3D hand prior at the spatial level for hand-garment contact and time-stamped action captions synchronized via Action-aware Rotational Position Embedding (A-RoPE) at the semantic level. The work claims SOTA performance on standard VVT benchmarks and a commanding lead on the interactive setting.

Significance. If the results hold, the formalization of Interactive VVT and the spatial-semantic guidance mechanism represent a meaningful advance toward controllable, dynamic virtual try-on for real-world apparel scenarios. The integration of 3D hand priors with action-aware embeddings in a diffusion Transformer is a targeted contribution to handling sparse interactions and deformations in video generation.

major comments (1)

[Method description and abstract] The central claim of a commanding lead on Interactive VVT (abstract) rests on the multi-level injection resolving semantic ambiguity and complex deformations. The garment-agnostic 3D hand prior supplies pose/contact geometry independent of garment material or topology, yet the manuscript provides no explicit mechanism (e.g., garment-conditioned deformation field or per-frame contact loss) to enable learning of garment-specific folding/stretching at contact points during sparse, brief interactions. This leaves open whether reported gains derive from the base diffusion Transformer rather than the proposed guidance.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit dataset names, evaluation metrics, and a brief summary of ablation studies to support the SOTA and commanding-lead claims.
[Method] Notation for A-RoPE and the precise injection points into the diffusion Transformer could be clarified with a diagram or pseudocode for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comment raises an important point about the mechanisms underlying garment-specific deformations, which we address below with clarifications from our experiments and revisions to the text.

read point-by-point responses

Referee: The central claim of a commanding lead on Interactive VVT (abstract) rests on the multi-level injection resolving semantic ambiguity and complex deformations. The garment-agnostic 3D hand prior supplies pose/contact geometry independent of garment material or topology, yet the manuscript provides no explicit mechanism (e.g., garment-conditioned deformation field or per-frame contact loss) to enable learning of garment-specific folding/stretching at contact points during sparse, brief interactions. This leaves open whether reported gains derive from the base diffusion Transformer rather than the proposed guidance.

Authors: We thank the referee for this observation. The 3D hand prior is deliberately garment-agnostic to supply robust, topology-independent contact geometry that generalizes across clothing types. Garment-specific folding and stretching are learned implicitly because the diffusion Transformer is conditioned on the target garment image at every step; the spatial hand guidance localizes where deformations must occur, while the time-stamped action captions (via A-RoPE) disambiguate the interaction semantics that dictate deformation style. Ablation experiments (Section 4.3 and supplementary material) show that removing the hand prior or A-RoPE produces statistically significant drops in contact accuracy and perceptual deformation quality on interactive sequences, indicating that the reported gains are not attributable to the base model alone. We have revised the method section to explicitly describe this conditioning pathway and added a paragraph clarifying the role of the diffusion process in learning deformations from the provided guidance. We agree that an explicit garment-conditioned deformation field or auxiliary contact loss would constitute a valuable extension and note this as future work. revision: partial

Circularity Check

0 steps flagged

No circularity in iTryOn derivation chain

full rationale

The paper formalizes a new Interactive VVT task and presents iTryOn as an independent framework built on a video diffusion Transformer, injecting a garment-agnostic 3D hand prior at the spatial level and time-stamped action captions via A-RoPE at the semantic level. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description; the multi-level guidance is introduced as a novel mechanism rather than derived from or reducing to prior results by construction. Empirical SOTA claims rest on experiments, not tautological re-derivation of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only, the central claim rests on standard assumptions of diffusion models for video generation plus two new guidance mechanisms whose effectiveness is asserted but not detailed. No explicit free parameters, axioms, or invented entities are listed in the provided text.

pith-pipeline@v0.9.0 · 5841 in / 1196 out tokens · 26147 ms · 2026-05-21T05:08:22.820371+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost.lean Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-level interaction injection mechanism... garment-agnostic 3D hand prior... Action-aware Rotational Position Embedding (A-RoPE)... action-aware constraint loss
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A-RoPE... scaled 1D-RoPE... k=4 separation scale

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

128 extracted references · 128 canonical work pages · 5 internal anchors

[1]

Abril and Robert Plant

Patricia S. Abril and Robert Plant. The patent holder's dilemma: Buy, sell, or troll?. Communications of the ACM. 2007. doi:10.1145/1188913.1188915

work page doi:10.1145/1188913.1188915 2007
[2]

Digital Image Processing, 4th Edition , author=

work page
[3]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

VITON: An Image-Based Virtual Try-on Network , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[4]

Proceedings of the European Conference on Computer Vision , pages=

Toward Characteristic-Preserving Image-Based Virtual Try-On Network , author=. Proceedings of the European Conference on Computer Vision , pages=

work page
[5]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[6]

VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization , year=

Choi, Seunghwan and Park, Sunghyun and Lee, Minsoo and Choo, Jaegul , booktitle=. VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization , year=

work page
[7]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

VTNFP: An Image-Based Virtual Try-On Network With Body and Clothing Feature Preservation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Parser-Free Virtual Try-on via Distilling Appearance Flows , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[9]

Proceedings of the European Conference on Computer Vision , year=

View Synthesis by Appearance Flow , author=. Proceedings of the European Conference on Computer Vision , year=

work page
[10]

IEEE Transactions on pattern analysis and machine intelligence , volume=

Principal warps: Thin-plate splines and the decomposition of deformations , author=. IEEE Transactions on pattern analysis and machine intelligence , volume=. 1989 , publisher=

work page 1989
[11]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Deep Image Spatial Transformation for Person Image Generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[12]

ACM transactions on graphics (TOG) , volume=

SMPL: A skinned multi-person linear model , author=. ACM transactions on graphics (TOG) , volume=. 2015 , publisher=

work page 2015
[13]

DensePose: Dense Human Pose Estimation in the Wild , year=

Güler, Riza Alp and Neverova, Natalia and Kokkinos, Iasonas , booktitle=. DensePose: Dense Human Pose Estimation in the Wild , year=

work page
[14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cross-Domain Correspondence Learning for Exemplar-Based Image Translation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[16]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Effective whole-body pose estimation with two-stages distillation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Dense Intrinsic Appearance Flow for Human Pose Transfer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page
[18]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[19]

ACM Transactions on Graphics , volume=

Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN , author=. ACM Transactions on Graphics , volume=

work page
[20]

2022 , booktitle=

Dressing in the Wild by Watching Dance Videos , author=. 2022 , booktitle=

work page 2022
[21]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ke Gong and Yiming Gao and Xiaodan Liang and Xiaohui Shen and Meng Wang and Liang Lin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[22]

and Sheikh, H.R

Zhou Wang and Bovik, A.C. and Sheikh, H.R. and Simoncelli, E.P. , journal=. Image quality assessment: from error visibility to structural similarity , year=

work page
[23]

arXiv preprint arXiv:2104.11222 , year=

On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation , author=. arXiv preprint arXiv:2104.11222 , year=

work page arXiv
[24]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Liu, Ziwei and Luo, Ping and Qiu, Shi and Wang, Xiaogang and Tang, Xiaoou , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[25]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year =

Rıza Alp Güler and Natalia Neverova and Iasonas Kokkinos , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year =

work page
[26]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

work page
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Semantic Image Synthesis With Spatially-Adaptive Normalization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page
[28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Analyzing and Improving the Image Quality of StyleGAN , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page
[29]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Towards Multi-pose Guided Virtual Try-on Network , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[30]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[31]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Controllable Person Image Synthesis With Attribute-Decomposed GAN , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page
[32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

work page
[33]

2014 , booktitle=

Generative Adversarial Networks , author=. 2014 , booktitle=

work page 2014
[34]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

ZFlow: Gated Appearance Flow-based Virtual Try-on with 3D Priors , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[35]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Clothflow: A flow-based model for clothed person generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[36]

Advances in neural information processing systems , volume=

Understanding the effective receptive field in deep convolutional neural networks , author=. Advances in neural information processing systems , volume=

work page
[37]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Learning dense correspondence via 3d-guided cycle consistency , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[38]

Proceedings of the IEEE International Conference on Computer Vision , pages=

Flownet: Learning optical flow with convolutional networks , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

work page
[39]

and Shechtman, Eli and Wang, Oliver , booktitle=

Zhang, Richard and Isola, Phillip and Efros, Alexei A. and Shechtman, Eli and Wang, Oliver , booktitle=. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , year=

work page
[40]

Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

Xie, Zhenyu and Huang, Zaiyu and Zhao, Fuwei and Dong, Haoye and Kampffmeyer, Michael and Liang, Xiaodan , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =

work page 2021
[41]

Advances in Neural Information Processing Systems , volume=

3d multi-bodies: Fitting sets of plausible 3d human models to ambiguous image data , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

Advances in Neural Information Processing Systems , volume=

3D Pose Transfer with Correspondence Learning and Mesh Refinement , author=. Advances in Neural Information Processing Systems , volume=

work page
[43]

NeurIPS , year=

Per-Pixel Classification is Not All You Need for Semantic Segmentation , author=. NeurIPS , year=

work page
[44]

2025 , eprint=

HunyuanVideo: A Systematic Framework For Large Video Generative Models , author=. 2025 , eprint=

work page 2025
[45]

2025 , eprint=

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model , author=. 2025 , eprint=

work page 2025
[46]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX 16 , pages=

Do not mask what you do not need to mask: a parser-free virtual try-on , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX 16 , pages=. 2020 , organization=

work page 2020
[47]

Style-Based Global Appearance Flow for Virtual Try-On , year=

He, Sen and Song, Yi-Zhe and Xiang, Tao , booktitle=. Style-Based Global Appearance Flow for Virtual Try-On , year=

work page
[48]

arXiv: Computer Vision and Pattern Recognition , year=

ZFlow: Gated Appearance Flow-based Virtual Try-on with 3D Priors , author=. arXiv: Computer Vision and Pattern Recognition , year=

work page
[49]

ACM Transactions on Graphics (TOG) , volume=

Low-light image enhancement with wavelet-based diffusion models , author=. ACM Transactions on Graphics (TOG) , volume=

work page
[50]

2023 , eprint=

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation , author=. 2023 , eprint=

work page 2023
[51]

ArXiv , year=

PixArt- : Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis , author=. ArXiv , year=

work page
[52]

FW-GAN: Flow-Navigated Warping GAN for Video Virtual Try-On , year=

Dong, Haoye and Liang, Xiaodan and Shen, Xiaohui and Wu, Bowen and Chen, Bing-Cheng and Yin, Jian , booktitle=. FW-GAN: Flow-Navigated Warping GAN for Video Virtual Try-On , year=

work page
[53]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ClothFormer: Taming Video Virtual Try-on in All Module , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[54]

2024 , isbn =

Xu, Zhengze and Chen, Mengting and Wang, Zhao and Xing, Linyu and Zhai, Zhonghua and Sang, Nong and Lan, Jinsong and Xiao, Shuai and Gao, Changxin , title =. 2024 , isbn =. doi:10.1145/3664647.3680836 , booktitle =

work page doi:10.1145/3664647.3680836 2024
[55]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2023
[56]

In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bjorn , year=. High-Resolution Image Synthesis with Latent Diffusion Models , url=. doi:10.1109/cvpr52688.2022.01042 , booktitle=

work page doi:10.1109/cvpr52688.2022.01042 2022
[57]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[58]

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models , author=. arXiv preprint arXiv:2302.08453 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2211.13227 , year=

Paint by Example: Exemplar-based Image Editing with Diffusion Models , author=. arXiv preprint arXiv:2211.13227 , year=

work page arXiv
[60]

2024 , eprint=

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. 2024 , eprint=

work page 2024
[61]

MV-TON: Memory-based Video Virtual Try-on network , url=

Zhong, Xiaojing and Wu, Zhonghua and Tan, Taizhe and Lin, Guosheng and Wu, Qingyao , year=. MV-TON: Memory-based Video Virtual Try-on network , url=. doi:10.1145/3474085.3475269 , booktitle=

work page doi:10.1145/3474085.3475269
[62]

ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on , url=

Kuppa, Gaurav and Jong, Andrew and Liu, Xin and Liu, Ziwei and Moh, Teng-Sheng , year=. ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on , url=. doi:10.1109/wacvw52041.2021.00025 , booktitle=

work page doi:10.1109/wacvw52041.2021.00025 2021
[63]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models , author=. arXiv preprint arXiv:2311.04145 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2306.02018 , year=

VideoComposer: Compositional Video Synthesis with Motion Controllability , author=. arXiv preprint arXiv:2306.02018 , year=

work page arXiv
[65]

International Conference on Learning Representations , year=

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning , author=. International Conference on Learning Representations , year=

work page
[66]

arXiv preprint arXiv:2311.16933 , year=

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models , author=. arXiv preprint arXiv:2311.16933 , year=

work page arXiv
[67]

Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers , author=. arXiv preprint arXiv:2405.05945 , year=

work page arXiv
[68]

Latte: Latent Diffusion Transformer for Video Generation

Latte: Latent Diffusion Transformer for Video Generation , author=. arXiv preprint arXiv:2401.03048 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

, title =

PKU-Yuan Lab and Tuzhan AI etc. , title =. doi:10.5281/zenodo.10948109 , url =

work page doi:10.5281/zenodo.10948109
[70]

Proceedings of the European Conference on Computer Vision (ECCV) , pages=

Toward Characteristic-Preserving Image-based Virtual Try-On Network , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

work page
[71]

2022 , journal=

Scalable Diffusion Models with Transformers , author=. 2022 , journal=

work page 2022
[72]

CoRR , year=

Auto-Encoding Variational Bayes , author=. CoRR , year=

work page
[73]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas. U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2015. 2015

work page 2015
[74]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

work page
[75]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Esser, Patrick and Rombach, Robin and Ommer, Bjorn , year=. Taming Transformers for High-Resolution Image Synthesis , url=. doi:10.1109/cvpr46437.2021.01268 , booktitle=

work page doi:10.1109/cvpr46437.2021.01268 2021
[76]

International conference on machine learning , pages=

Perceiver: General perception with iterative attention , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[77]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. J. Mach. Learn. Res. , year=

work page
[78]

arXiv preprint arXiv:2311.17117 , website=

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation , author=. arXiv preprint arXiv:2311.17117 , website=

work page arXiv
[79]

2024 , eprint=

StableGarment: Garment-Centric Generation via Stable Diffusion , author=. 2024 , eprint=

work page 2024
[80]

2025 , isbn =

Xu, Yuhao and Gu, Tao and Chen, Weifeng and Chen, Arlene , title =. 2025 , isbn =. doi:10.1609/aaai.v39i9.32973 , booktitle =

work page doi:10.1609/aaai.v39i9.32973 2025

Showing first 80 references.

[1] [1]

Abril and Robert Plant

Patricia S. Abril and Robert Plant. The patent holder's dilemma: Buy, sell, or troll?. Communications of the ACM. 2007. doi:10.1145/1188913.1188915

work page doi:10.1145/1188913.1188915 2007

[2] [2]

Digital Image Processing, 4th Edition , author=

work page

[3] [3]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

VITON: An Image-Based Virtual Try-on Network , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page

[4] [4]

Proceedings of the European Conference on Computer Vision , pages=

Toward Characteristic-Preserving Image-Based Virtual Try-On Network , author=. Proceedings of the European Conference on Computer Vision , pages=

work page

[5] [5]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[6] [6]

VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization , year=

Choi, Seunghwan and Park, Sunghyun and Lee, Minsoo and Choo, Jaegul , booktitle=. VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization , year=

work page

[7] [7]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

VTNFP: An Image-Based Virtual Try-On Network With Body and Clothing Feature Preservation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[8] [8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Parser-Free Virtual Try-on via Distilling Appearance Flows , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[9] [9]

Proceedings of the European Conference on Computer Vision , year=

View Synthesis by Appearance Flow , author=. Proceedings of the European Conference on Computer Vision , year=

work page

[10] [10]

IEEE Transactions on pattern analysis and machine intelligence , volume=

Principal warps: Thin-plate splines and the decomposition of deformations , author=. IEEE Transactions on pattern analysis and machine intelligence , volume=. 1989 , publisher=

work page 1989

[11] [11]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Deep Image Spatial Transformation for Person Image Generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[12] [12]

ACM transactions on graphics (TOG) , volume=

SMPL: A skinned multi-person linear model , author=. ACM transactions on graphics (TOG) , volume=. 2015 , publisher=

work page 2015

[13] [13]

DensePose: Dense Human Pose Estimation in the Wild , year=

Güler, Riza Alp and Neverova, Natalia and Kokkinos, Iasonas , booktitle=. DensePose: Dense Human Pose Estimation in the Wild , year=

work page

[14] [14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cross-Domain Correspondence Learning for Exemplar-Based Image Translation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[15] [15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[16] [16]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Effective whole-body pose estimation with two-stages distillation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[17] [17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Dense Intrinsic Appearance Flow for Human Pose Transfer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page

[18] [18]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[19] [19]

ACM Transactions on Graphics , volume=

Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN , author=. ACM Transactions on Graphics , volume=

work page

[20] [20]

2022 , booktitle=

Dressing in the Wild by Watching Dance Videos , author=. 2022 , booktitle=

work page 2022

[21] [21]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ke Gong and Yiming Gao and Xiaodan Liang and Xiaohui Shen and Meng Wang and Liang Lin , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[22] [22]

and Sheikh, H.R

Zhou Wang and Bovik, A.C. and Sheikh, H.R. and Simoncelli, E.P. , journal=. Image quality assessment: from error visibility to structural similarity , year=

work page

[23] [23]

arXiv preprint arXiv:2104.11222 , year=

On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation , author=. arXiv preprint arXiv:2104.11222 , year=

work page arXiv

[24] [24]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Liu, Ziwei and Luo, Ping and Qiu, Shi and Wang, Xiaogang and Tang, Xiaoou , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page

[25] [25]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year =

Rıza Alp Güler and Natalia Neverova and Iasonas Kokkinos , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year =

work page

[26] [26]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

work page

[27] [27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Semantic Image Synthesis With Spatially-Adaptive Normalization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page

[28] [28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Analyzing and Improving the Image Quality of StyleGAN , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page

[29] [29]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Towards Multi-pose Guided Virtual Try-on Network , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[30] [30]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[31] [31]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Controllable Person Image Synthesis With Attribute-Decomposed GAN , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

work page

[32] [32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

work page

[33] [33]

2014 , booktitle=

Generative Adversarial Networks , author=. 2014 , booktitle=

work page 2014

[34] [34]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

ZFlow: Gated Appearance Flow-based Virtual Try-on with 3D Priors , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[35] [35]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Clothflow: A flow-based model for clothed person generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[36] [36]

Advances in neural information processing systems , volume=

Understanding the effective receptive field in deep convolutional neural networks , author=. Advances in neural information processing systems , volume=

work page

[37] [37]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Learning dense correspondence via 3d-guided cycle consistency , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page

[38] [38]

Proceedings of the IEEE International Conference on Computer Vision , pages=

Flownet: Learning optical flow with convolutional networks , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

work page

[39] [39]

and Shechtman, Eli and Wang, Oliver , booktitle=

Zhang, Richard and Isola, Phillip and Efros, Alexei A. and Shechtman, Eli and Wang, Oliver , booktitle=. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , year=

work page

[40] [40]

Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

Xie, Zhenyu and Huang, Zaiyu and Zhao, Fuwei and Dong, Haoye and Kampffmeyer, Michael and Liang, Xiaodan , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =

work page 2021

[41] [41]

Advances in Neural Information Processing Systems , volume=

3d multi-bodies: Fitting sets of plausible 3d human models to ambiguous image data , author=. Advances in Neural Information Processing Systems , volume=

work page

[42] [42]

Advances in Neural Information Processing Systems , volume=

3D Pose Transfer with Correspondence Learning and Mesh Refinement , author=. Advances in Neural Information Processing Systems , volume=

work page

[43] [43]

NeurIPS , year=

Per-Pixel Classification is Not All You Need for Semantic Segmentation , author=. NeurIPS , year=

work page

[44] [44]

2025 , eprint=

HunyuanVideo: A Systematic Framework For Large Video Generative Models , author=. 2025 , eprint=

work page 2025

[45] [45]

2025 , eprint=

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model , author=. 2025 , eprint=

work page 2025

[46] [46]

Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX 16 , pages=

Do not mask what you do not need to mask: a parser-free virtual try-on , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX 16 , pages=. 2020 , organization=

work page 2020

[47] [47]

Style-Based Global Appearance Flow for Virtual Try-On , year=

He, Sen and Song, Yi-Zhe and Xiang, Tao , booktitle=. Style-Based Global Appearance Flow for Virtual Try-On , year=

work page

[48] [48]

arXiv: Computer Vision and Pattern Recognition , year=

ZFlow: Gated Appearance Flow-based Virtual Try-on with 3D Priors , author=. arXiv: Computer Vision and Pattern Recognition , year=

work page

[49] [49]

ACM Transactions on Graphics (TOG) , volume=

Low-light image enhancement with wavelet-based diffusion models , author=. ACM Transactions on Graphics (TOG) , volume=

work page

[50] [50]

2023 , eprint=

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation , author=. 2023 , eprint=

work page 2023

[51] [51]

ArXiv , year=

PixArt- : Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis , author=. ArXiv , year=

work page

[52] [52]

FW-GAN: Flow-Navigated Warping GAN for Video Virtual Try-On , year=

Dong, Haoye and Liang, Xiaodan and Shen, Xiaohui and Wu, Bowen and Chen, Bing-Cheng and Yin, Jian , booktitle=. FW-GAN: Flow-Navigated Warping GAN for Video Virtual Try-On , year=

work page

[53] [53]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ClothFormer: Taming Video Virtual Try-on in All Module , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page

[54] [54]

2024 , isbn =

Xu, Zhengze and Chen, Mengting and Wang, Zhao and Xing, Linyu and Zhai, Zhonghua and Sang, Nong and Lan, Jinsong and Xiao, Shuai and Gao, Changxin , title =. 2024 , isbn =. doi:10.1145/3664647.3680836 , booktitle =

work page doi:10.1145/3664647.3680836 2024

[55] [55]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2023

[56] [56]

In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bjorn , year=. High-Resolution Image Synthesis with Latent Diffusion Models , url=. doi:10.1109/cvpr52688.2022.01042 , booktitle=

work page doi:10.1109/cvpr52688.2022.01042 2022

[57] [57]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[58] [58]

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models , author=. arXiv preprint arXiv:2302.08453 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

arXiv preprint arXiv:2211.13227 , year=

Paint by Example: Exemplar-based Image Editing with Diffusion Models , author=. arXiv preprint arXiv:2211.13227 , year=

work page arXiv

[60] [60]

2024 , eprint=

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. 2024 , eprint=

work page 2024

[61] [61]

MV-TON: Memory-based Video Virtual Try-on network , url=

Zhong, Xiaojing and Wu, Zhonghua and Tan, Taizhe and Lin, Guosheng and Wu, Qingyao , year=. MV-TON: Memory-based Video Virtual Try-on network , url=. doi:10.1145/3474085.3475269 , booktitle=

work page doi:10.1145/3474085.3475269

[62] [62]

ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on , url=

Kuppa, Gaurav and Jong, Andrew and Liu, Xin and Liu, Ziwei and Moh, Teng-Sheng , year=. ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on , url=. doi:10.1109/wacvw52041.2021.00025 , booktitle=

work page doi:10.1109/wacvw52041.2021.00025 2021

[63] [63]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models , author=. arXiv preprint arXiv:2311.04145 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

arXiv preprint arXiv:2306.02018 , year=

VideoComposer: Compositional Video Synthesis with Motion Controllability , author=. arXiv preprint arXiv:2306.02018 , year=

work page arXiv

[65] [65]

International Conference on Learning Representations , year=

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning , author=. International Conference on Learning Representations , year=

work page

[66] [66]

arXiv preprint arXiv:2311.16933 , year=

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models , author=. arXiv preprint arXiv:2311.16933 , year=

work page arXiv

[67] [67]

Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers.arXiv preprint arXiv:2405.05945, 2024

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers , author=. arXiv preprint arXiv:2405.05945 , year=

work page arXiv

[68] [68]

Latte: Latent Diffusion Transformer for Video Generation

Latte: Latent Diffusion Transformer for Video Generation , author=. arXiv preprint arXiv:2401.03048 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

, title =

PKU-Yuan Lab and Tuzhan AI etc. , title =. doi:10.5281/zenodo.10948109 , url =

work page doi:10.5281/zenodo.10948109

[70] [70]

Proceedings of the European Conference on Computer Vision (ECCV) , pages=

Toward Characteristic-Preserving Image-based Virtual Try-On Network , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=

work page

[71] [71]

2022 , journal=

Scalable Diffusion Models with Transformers , author=. 2022 , journal=

work page 2022

[72] [72]

CoRR , year=

Auto-Encoding Variational Bayes , author=. CoRR , year=

work page

[73] [73]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas. U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2015. 2015

work page 2015

[74] [74]

Neural Information Processing Systems , year=

Attention is All you Need , author=. Neural Information Processing Systems , year=

work page

[75] [75]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Esser, Patrick and Rombach, Robin and Ommer, Bjorn , year=. Taming Transformers for High-Resolution Image Synthesis , url=. doi:10.1109/cvpr46437.2021.01268 , booktitle=

work page doi:10.1109/cvpr46437.2021.01268 2021

[76] [76]

International conference on machine learning , pages=

Perceiver: General perception with iterative attention , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[77] [77]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. J. Mach. Learn. Res. , year=

work page

[78] [78]

arXiv preprint arXiv:2311.17117 , website=

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation , author=. arXiv preprint arXiv:2311.17117 , website=

work page arXiv

[79] [79]

2024 , eprint=

StableGarment: Garment-Centric Generation via Stable Diffusion , author=. 2024 , eprint=

work page 2024

[80] [80]

2025 , isbn =

Xu, Yuhao and Gu, Tao and Chen, Weifeng and Chen, Arlene , title =. 2025 , isbn =. doi:10.1609/aaai.v39i9.32973 , booktitle =

work page doi:10.1609/aaai.v39i9.32973 2025