ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On

Hyeryung Jang; Junseo Park

arxiv: 2509.25749 · v2 · submitted 2025-09-30 · 💻 cs.CV · cs.AI

ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On

Junseo Park , Hyeryung Jang This is my paper

Pith reviewed 2026-05-18 13:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords virtual try-onlatent diffusion modelsmeasurement-guided samplingartifact-free generationboundary artifactsimage synthesisclothing transferinverse problem

0 comments

The pith

Reformulating virtual try-on as a linear inverse problem lets latent diffusion models eliminate boundary artifacts while preserving identity and background.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Virtual try-on requires swapping a garment onto a person while keeping the rest of the image unchanged, but diffusion models often create visible seams at the boundaries. The paper claims that casting the task as recovering an image from incomplete measurements lets the model gradually enforce consistency instead of patching regions after generation. To make this work without drift, it adds residual prior initialization to align training and inference, then runs a sampling loop that repeatedly applies data consistency, corrects frequencies, and falls back to ordinary denoising. If the approach holds, try-on results become seamless on high-resolution person images without post-processing or identity loss.

Core claim

ART-VITON reformulates VITON as a linear inverse problem and adopts trajectory-aligned solvers that progressively enforce measurement consistency. It further introduces residual prior-based initialization to reduce training-inference mismatch together with artifact-free measurement-guided sampling that interleaves data consistency, frequency-level correction, and periodic standard denoising.

What carries the argument

Artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising, initialized via residual prior to align training and inference trajectories.

If this is right

Non-try-on regions stay closer to the original image content without abrupt transitions.
Semantic drift is reduced during the diffusion trajectory compared with prior solvers.
Visual quality and robustness increase over baselines on VITON-HD, DressCode, and SHHQ-1.0.
No separate post-hoc replacement step is needed to hide boundary artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same measurement-consistency loop could be tested on other region-preserving edits such as face swapping or object insertion.
Frequency-level correction steps might stabilize long-range consistency in video try-on sequences.
The linear-inverse framing suggests trying the sampler on tasks where only partial image observations are available.

Load-bearing premise

Treating virtual try-on as a linear inverse problem and using trajectory-aligned solvers will enforce measurement consistency without causing semantic drift in non-try-on regions.

What would settle it

If visual inspection or metrics on the VITON-HD or DressCode test sets still show boundary seams or identity changes in preserved regions after applying ART-VITON, the claim that the guided sampling removes artifacts would be false.

Figures

Figures reproduced from arXiv: 2509.25749 by Hyeryung Jang, Junseo Park.

**Figure 2.** Figure 2: ART-VITON pipeline. (A) Residual prior-based initialization mitigates train-test mismatch. (B) Artifact-free measurement-guided inverse solver enforces measurements while preserving semantics: ⃝1 Tweedie estimation retains clothing details but lacks fidelity in non-try-on regions. ⃝2 Hard measurement constraints in pixel space correct preserved regions. High-frequency losses during ⃝3 VAE encoding are co… view at source ↗

**Figure 3.** Figure 3: Comparison of StableVITON baseline and inverse solvers on VITON-HD. (a) High [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of baseline models with and without our method across datasets. (a) On [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study of pipeline components. Direct measurement enforcement increases arti [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of baseline models on the SHHQ-1.0 dataset. Our observations show [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: StableVITON on VITON-HD with inverse solvers applied without post-hoc replacement. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison on the VITON-HD dataset with baseline (StableVITON) and existing in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Additional qualitative results on the VITON-HD comparing baseline methods with our [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Extended comparison demonstrating robustness across domains on the SHHQ-1.0 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of baseline VITON methods on DressCode dataset. Traditional [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ART-VITON recasts virtual try-on as a linear inverse problem with a custom measurement-guided sampler to reduce boundary artifacts, but the abstract gives no numbers or ablations to show how well it works.

read the letter

The main thing to know is that this paper treats virtual try-on as a linear inverse problem and adds a measurement-guided diffusion sampler with residual prior initialization plus a mix of data consistency, frequency correction, and periodic denoising. The goal is to avoid the seams that come from simple post-hoc replacement of non-try-on regions while keeping identity and background intact. That combination of pieces looks new relative to the VITON papers cited in the abstract and targets a practical pain point in diffusion-based try-on for fashion applications. If the full results hold, it could be a useful engineering step for cleaner outputs on datasets like VITON-HD, DressCode, and SHHQ-1.0. The method earns credit for trying to enforce consistency progressively through trajectory-aligned solvers instead of relying on abrupt fixes, and the residual prior step directly addresses training-inference mismatch in a straightforward way. The thinking stays focused on the application constraints without claiming to rewrite diffusion theory. The soft spots are clear from the abstract alone. There are no quantitative tables, error bars, or ablation details shown, so the size of the claimed gains in visual fidelity and robustness is hard to judge. The linear inverse reformulation plus the added corrections may still leave room for semantic drift in garment fit and pose handling, since body-garment interactions are not strictly linear; the stress-test concern lands here and would need checking against the actual equations and failure cases in the manuscript. This is for CV researchers working on constrained image synthesis or practical virtual try-on systems. A reader who needs incremental sampling fixes for artifact issues would pick up usable ideas. The work shows coherent engagement with the problem and prior literature, so it deserves a serious referee to examine the experiments and the linear assumption in detail. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ART-VITON, a measurement-guided latent diffusion framework for virtual try-on. It reformulates VITON as a linear inverse problem and employs trajectory-aligned solvers with residual prior-based initialization plus a sampling procedure that interleaves data consistency, frequency-level correction, and periodic standard denoising. The central claim is that this eliminates boundary artifacts while preserving identity and background, yielding consistent visual improvements over baselines on VITON-HD, DressCode, and SHHQ-1.0.

Significance. If the quantitative and ablation evidence supports the claims, the work offers a principled integration of inverse-problem constraints into diffusion sampling for partial-preservation tasks. This could influence artifact mitigation strategies in constrained generative models beyond virtual try-on.

major comments (2)

[§4 Experiments] §4 Experiments and §3.3 Sampling: The abstract and method claim consistent improvements and artifact elimination, yet no quantitative tables, metrics, error bars, or ablation studies are referenced to substantiate the magnitude of gains or isolate the contribution of the measurement-guided components versus baselines.
[§3.1] §3.1 Inverse-problem reformulation: The central premise that a linear measurement operator (non-try-on regions) combined with trajectory-aligned solvers will enforce consistency without semantic drift in the try-on region lacks a derivation or analysis showing why the operator captures non-linear garment-body interactions; this assumption is load-bearing for the artifact-free claim.

minor comments (2)

[Abstract] Abstract: The phrase 'trajectory-aligned solvers' is introduced without a brief definition or citation; a short parenthetical explanation would improve accessibility.
[Figures] Figure captions: Several qualitative comparison figures would benefit from explicit call-outs or zoomed insets on boundary regions to visually demonstrate the claimed artifact reduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed feedback on our work. We have prepared point-by-point responses to the major comments and revised the manuscript to address the concerns raised regarding experimental validation and theoretical justification.

read point-by-point responses

Referee: [§4 Experiments] §4 Experiments and §3.3 Sampling: The abstract and method claim consistent improvements and artifact elimination, yet no quantitative tables, metrics, error bars, or ablation studies are referenced to substantiate the magnitude of gains or isolate the contribution of the measurement-guided components versus baselines.

Authors: We thank the referee for this observation. Section 4 of the manuscript does include quantitative comparisons using FID, LPIPS, and SSIM on VITON-HD, DressCode, and SHHQ-1.0 with visual results against baselines. However, to better isolate the contributions of residual prior initialization, data consistency, frequency correction, and periodic denoising, we have added a comprehensive ablation study (new Table 3) with error bars computed over three independent runs. This revision directly substantiates the magnitude of improvements and the role of each measurement-guided component. revision: yes
Referee: [§3.1] §3.1 Inverse-problem reformulation: The central premise that a linear measurement operator (non-try-on regions) combined with trajectory-aligned solvers will enforce consistency without semantic drift in the try-on region lacks a derivation or analysis showing why the operator captures non-linear garment-body interactions; this assumption is load-bearing for the artifact-free claim.

Authors: We agree that additional analysis strengthens the central claim. The linear measurement operator is explicitly the masking operator that preserves non-try-on pixels exactly while allowing free generation in try-on regions; it does not attempt to model non-linear garment-body interactions, which are instead handled by the underlying diffusion prior. The trajectory-aligned solver enforces consistency at each step to avoid boundary discontinuities. In the revised §3.1 we have added a short derivation showing how the combined data-consistency and frequency-correction steps bound semantic drift, supported by an analysis of the frequency spectrum at the garment boundary. This clarifies the artifact-free mechanism without overstating the linearity assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper reformulates virtual try-on as a linear inverse problem and introduces novel components including residual prior-based initialization and measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. No equations, claims, or steps in the abstract or description reduce the proposed improvements or results to quantities defined by the authors' own prior fits, self-citations that carry the central argument, or self-definitional constructions. The derivation relies on new methodological contributions to address boundary artifacts and semantic drift rather than renaming or recycling fitted inputs as predictions, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract does not list explicit free parameters or invented entities; the approach relies on standard diffusion assumptions plus the unstated premise that measurement consistency can be enforced without semantic drift.

axioms (1)

domain assumption Latent diffusion models can be guided by measurement consistency in a linear inverse problem setting
Invoked when the abstract states that trajectory-aligned solvers progressively enforce measurement consistency

pith-pipeline@v0.9.0 · 5744 in / 1262 out tokens · 36910 ms · 2026-05-18T13:21:57.488957+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

[1]

Stylegan-human: A data-centric odyssey of human generation.arXiv preprint, arXiv:2204.11823,

Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Chen Qian, Chen-Change Loy, Wayne Wu, and Ziwei Liu. Stylegan-human: A data-centric odyssey of human generation.arXiv preprint, arXiv:2204.11823,

work page arXiv
[2]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Auto-Encoding Variational Bayes

URLhttps: //arxiv.org/abs/1312.6114. Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo. High-resolution virtual try-on with misalignment and occlusion-handled conditions. InEuropean Conference on Computer Vision, pp. 204–219. Springer,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Repaint: Inpainting using denoising diffusion probabilistic models, 2022a

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models, 2022a. URLhttps:// arxiv.org/abs/2201.09865. Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilisti...

work page arXiv
[5]

Vivat: Virtuous improving vae training through artifact mitigation.arXiv preprint arXiv:2506.07863,

Lev Novitskiy, Viacheslav Vasilev, Maria Kovaleva, Vladimir Arkhipkin, and Denis Dimitrov. Vivat: Virtuous improving vae training through artifact mitigation.arXiv preprint arXiv:2506.07863,

work page arXiv
[6]

Dreampaint: Few-shot inpainting of e-commerce items for virtual try-on without 3d modeling

Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, and Ismail B Tutar. Dreampaint: Few-shot inpainting of e-commerce items for virtual try-on without 3d modeling. arXiv preprint arXiv:2305.01257,

work page arXiv
[7]

Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, and Tao Mei

URLhttps: //arxiv.org/abs/2307.08123. Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, and Tao Mei. Improving virtual try-on with garment-focused diffusion models. InEuropean Conference on Computer Vision, pp. 184–199. Springer,

work page arXiv
[8]

Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization

Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. InThe European Conference on Computer Vision (ECCV) 2024,

work page 2024
[9]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Boow-vton: Boosting in-the-wild virtual try-on via mask-free pseudo data training

Xuanpu Zhang, Dan Song, Pengxin Zhan, Tianyu Chang, Jianhao Zeng, Qingguo Chen, Weihua Luo, and An-An Liu. Boow-vton: Boosting in-the-wild virtual try-on via mask-free pseudo data training. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 26399– 26408, 2025b. 13 A APPENDIX A.1 USE OF LARGE LANGUAGE MODELS We used a large langua...

work page 2025
[11]

(2022a).The deterministic DDIM sampling forms the basis for all inverse solvers

DDIM sampling Lugmayr et al. (2022a).The deterministic DDIM sampling forms the basis for all inverse solvers. Given a noisy latentz t at timestept, we first estimate the clean latent using Tweedie’s formula: ˆz(t) 0 = 1√¯αt zt − √ 1−¯αt ·ϵ θ(zt, t,c) .(8) The denoising step then updates the latent to timestept−1: zt−1 = √¯αt−1ˆz(t) 0 + p 1−¯αt−1ϵθ(zt, t,c...

work page 2022
[12]

FIG (Flow with Interpolant Guidance) Yan et al. (2025).By operating directly on the noisy latent, FIG performs gradient updates along the diffusion trajectory, preserving stability and sample diversity, whereas Tweedie-space optimization is more precise but incurs higher computational cost and reduces diversity. z′ t−1 =z t−1 −γ∇ zt−1 ∥¯yt−1 − ¯M⊙z t−1∥2 ...

work page 2025
[13]

˜ϵt := p 1−¯αt−1 −η 2β2 t ·ϵ θ +ηβ t ·ϵ√1−¯αt−1 ,ϵ∼ N(0,I),(16) whereηcontrols the noise level andβ t is the noise schedule

A.3.3 HYBRID STOCHASTIC METHODS These approaches combine deterministic updates with stochastic noise injection, where the degree of stochasticity is controlled throughηβ t, to balance measurement consistency and generation di- versity. ˜ϵt := p 1−¯αt−1 −η 2β2 t ·ϵ θ +ηβ t ·ϵ√1−¯αt−1 ,ϵ∼ N(0,I),(16) whereηcontrols the noise level andβ t is the noise schedu...

work page 2025

[1] [1]

Stylegan-human: A data-centric odyssey of human generation.arXiv preprint, arXiv:2204.11823,

Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Chen Qian, Chen-Change Loy, Wayne Wu, and Ziwei Liu. Stylegan-human: A data-centric odyssey of human generation.arXiv preprint, arXiv:2204.11823,

work page arXiv

[2] [2]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Auto-Encoding Variational Bayes

URLhttps: //arxiv.org/abs/1312.6114. Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo. High-resolution virtual try-on with misalignment and occlusion-handled conditions. InEuropean Conference on Computer Vision, pp. 204–219. Springer,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Repaint: Inpainting using denoising diffusion probabilistic models, 2022a

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models, 2022a. URLhttps:// arxiv.org/abs/2201.09865. Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilisti...

work page arXiv

[5] [5]

Vivat: Virtuous improving vae training through artifact mitigation.arXiv preprint arXiv:2506.07863,

Lev Novitskiy, Viacheslav Vasilev, Maria Kovaleva, Vladimir Arkhipkin, and Denis Dimitrov. Vivat: Virtuous improving vae training through artifact mitigation.arXiv preprint arXiv:2506.07863,

work page arXiv

[6] [6]

Dreampaint: Few-shot inpainting of e-commerce items for virtual try-on without 3d modeling

Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, and Ismail B Tutar. Dreampaint: Few-shot inpainting of e-commerce items for virtual try-on without 3d modeling. arXiv preprint arXiv:2305.01257,

work page arXiv

[7] [7]

Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, and Tao Mei

URLhttps: //arxiv.org/abs/2307.08123. Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, and Tao Mei. Improving virtual try-on with garment-focused diffusion models. InEuropean Conference on Computer Vision, pp. 184–199. Springer,

work page arXiv

[8] [8]

Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization

Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. InThe European Conference on Computer Vision (ECCV) 2024,

work page 2024

[9] [9]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Boow-vton: Boosting in-the-wild virtual try-on via mask-free pseudo data training

Xuanpu Zhang, Dan Song, Pengxin Zhan, Tianyu Chang, Jianhao Zeng, Qingguo Chen, Weihua Luo, and An-An Liu. Boow-vton: Boosting in-the-wild virtual try-on via mask-free pseudo data training. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 26399– 26408, 2025b. 13 A APPENDIX A.1 USE OF LARGE LANGUAGE MODELS We used a large langua...

work page 2025

[11] [11]

(2022a).The deterministic DDIM sampling forms the basis for all inverse solvers

DDIM sampling Lugmayr et al. (2022a).The deterministic DDIM sampling forms the basis for all inverse solvers. Given a noisy latentz t at timestept, we first estimate the clean latent using Tweedie’s formula: ˆz(t) 0 = 1√¯αt zt − √ 1−¯αt ·ϵ θ(zt, t,c) .(8) The denoising step then updates the latent to timestept−1: zt−1 = √¯αt−1ˆz(t) 0 + p 1−¯αt−1ϵθ(zt, t,c...

work page 2022

[12] [12]

FIG (Flow with Interpolant Guidance) Yan et al. (2025).By operating directly on the noisy latent, FIG performs gradient updates along the diffusion trajectory, preserving stability and sample diversity, whereas Tweedie-space optimization is more precise but incurs higher computational cost and reduces diversity. z′ t−1 =z t−1 −γ∇ zt−1 ∥¯yt−1 − ¯M⊙z t−1∥2 ...

work page 2025

[13] [13]

˜ϵt := p 1−¯αt−1 −η 2β2 t ·ϵ θ +ηβ t ·ϵ√1−¯αt−1 ,ϵ∼ N(0,I),(16) whereηcontrols the noise level andβ t is the noise schedule

A.3.3 HYBRID STOCHASTIC METHODS These approaches combine deterministic updates with stochastic noise injection, where the degree of stochasticity is controlled throughηβ t, to balance measurement consistency and generation di- versity. ˜ϵt := p 1−¯αt−1 −η 2β2 t ·ϵ θ +ηβ t ·ϵ√1−¯αt−1 ,ϵ∼ N(0,I),(16) whereηcontrols the noise level andβ t is the noise schedu...

work page 2025