One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion

Guanbin Li; Guangrun Wang; Jinxi Liu; Liang Lin; Zijian He

arxiv: 2508.04559 · v3 · submitted 2025-08-06 · 💻 cs.CV

One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion

Jinxi Liu , Zijian He , Guangrun Wang , Guanbin Li , Liang Lin This is my paper

Pith reviewed 2026-05-19 00:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords virtual try-onvirtual try-offdiffusion modelsgarment synthesispose conditioningimage editingmask-free synthesis

0 comments

The pith

A single diffusion model unifies virtual try-on and try-off without masks or exhibition garments for arbitrary poses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OMFA, a unified framework that performs both adding and removing garments from a single person portrait using only a target garment image as input. It adapts the mask-based denoising from discrete diffusion language models into a bidirectional Tweedie process that selectively edits clothing in latent space. This removes the usual requirements for segmentation masks and reference garments shown on a model, while adding SMPL-X pose data to handle any viewpoint or body position. If the approach holds, virtual fitting becomes practical for cross-person transfers and flexible outfit changes from everyday photos.

Core claim

OMFA is a unified diffusion framework built upon a Bidirectional Tweedie Diffusion process for target-selective denoising in latent space, enabling both try-on and try-off tasks without requiring exhibition garments or segmentation masks, while supporting arbitrary poses and multi-view synthesis from a single input image via SMPL-X-based pose conditioning.

What carries the argument

The Bidirectional Tweedie Diffusion process, which performs target-selective denoising inspired by discrete diffusion language models to unify try-on and try-off operations in latent space.

If this is right

Garment transfer works across different people without needing the original worn garment or any body segmentation.
Outfit changes remain possible even when the target pose differs from the reference portrait.
A single trained model handles both adding clothes and removing them instead of separate systems.
Multi-view and arbitrary-pose outputs are produced directly from one image plus pose parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective denoising idea could apply to other image editing tasks such as object replacement or background removal.
Training data needs might decrease because paired exhibition-garment examples are no longer required.
Extending the process to short video sequences could enable consistent outfit changes across motion.

Load-bearing premise

The bidirectional diffusion process can reliably add or remove specific garments while keeping body shape and pose consistent without any explicit masks or reference clothing images.

What would settle it

Generate try-on results for input pairs where the person and garment come from very different body types or lighting conditions and check whether garment boundaries and textures remain accurate without visible seams or distortions.

read the original abstract

Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios; for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce OMFA (One Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. OMFA is inspired by the mask-based paradigm of discrete diffusion language models and unifies try-on and try-off within a bidirectional framework. It is built upon a Bidirectional Tweedie Diffusion process for target-selective denoising in latent space. Instead of imposing lower body constraints, OMFA is an entirely mask-free framework that requires only a single portrait and a target garment as inputs, and is designed to support flexible outfit combinations and cross-person garment transfer, making it better aligned with practical usage scenarios. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. Project page: https://onemodelforall.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OMFA unifies try-on and try-off in one mask-free bidirectional diffusion model with arbitrary-pose support via SMPL-X, but the implicit garment localization still needs stronger checks.

read the letter

OMFA unifies virtual try-on and try-off into a single model using a bidirectional Tweedie diffusion process. This is the core advance: instead of separate pipelines or mask-dependent methods, it does both directions mask-free from just a portrait and target garment, with pose conditioning for flexibility. They draw from discrete diffusion ideas in language models to create target-selective denoising in continuous latent space. The SMPL-X integration allows multi-view and pose changes from one image, which addresses a clear pain point in prior work that stuck to reference poses. The work does a solid job identifying the limitations of existing methods, like pose restrictions and mask requirements, and proposes a framework that directly tackles them. If the quantitative results back up the SOTA claims on standard benchmarks, that would be a useful step forward in making these tools more deployable. The potential weak point is whether the implicit target-selective denoising really works reliably. Without explicit masks, the model has to learn from data to change only the relevant clothing areas while keeping identity, background, and pose intact. When source and target differ a lot in shape or style, this could produce artifacts like blending or wrong garment placement. The stress-test note raises this, and I'd check the paper's ablations or failure cases to see how they address it. Also, since the abstract lacks numbers, the full experiments section needs to show clear improvements with error bars or statistical significance. Overall, this is for people in the virtual try-on and digital fashion area of computer vision. A reader looking for practical extensions of diffusion models to editing tasks would get something out of it. The combination of unification and pose flexibility makes it worth a serious look from referees, even if some details on the diffusion schedule need tightening. I'd recommend sending it for peer review. The ideas are grounded enough to warrant feedback on the implementation and results.

Referee Report

2 major / 2 minor

Summary. The paper introduces OMFA, a unified diffusion framework for virtual try-on and try-off that operates without exhibition garments or segmentation masks. It employs a Bidirectional Tweedie Diffusion process inspired by discrete diffusion language models, with SMPL-X-based pose conditioning to enable arbitrary-pose and multi-view synthesis from a single portrait and target garment. The authors claim this mask-free approach supports flexible outfit combinations and cross-person transfers while achieving state-of-the-art results on both tasks.

Significance. If validated, the unification of try-on and try-off in a single mask-free model with arbitrary-pose support would be a notable advance for practical virtual garment synthesis, addressing real-world constraints like fixed poses and mask requirements that limit current methods. The LLM-inspired bidirectional mechanism and SMPL-X conditioning represent a promising direction for implicit target-selective editing in latent space.

major comments (2)

[§3.2] §3.2 (Bidirectional Tweedie Diffusion formulation): The description of target-selective denoising relies on an implicit inductive bias from the bidirectional schedule, but no equation or derivation shows how the forward/reverse processes localize garment changes without explicit masking or exhibition garments. This mechanism is load-bearing for the central unification claim, as failure here would produce blending or identity leakage in cross-person cases.
[§5.2] §5.2 (Quantitative results on arbitrary-pose try-on): The reported metrics for pose variation and cross-person transfer lack ablation on the SMPL-X conditioning strength and comparison to mask-free baselines under large body-shape differences; without these, the SOTA claim for generalizability cannot be fully assessed.

minor comments (2)

[Abstract] The abstract states 'extensive experiments demonstrate SOTA' but omits dataset names, metric values, or error bars; moving a concise summary of key numbers to the abstract would improve accessibility.
[§3] Notation for the Tweedie process parameters (e.g., the bidirectional noise schedule) is introduced without a dedicated table of symbols, making cross-references between equations harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review. We address the major comments below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Bidirectional Tweedie Diffusion formulation): The description of target-selective denoising relies on an implicit inductive bias from the bidirectional schedule, but no equation or derivation shows how the forward/reverse processes localize garment changes without explicit masking or exhibition garments. This mechanism is load-bearing for the central unification claim, as failure here would produce blending or identity leakage in cross-person cases.

Authors: We appreciate the referee pointing out the need for a clearer mathematical justification of the target-selective denoising mechanism. The Bidirectional Tweedie Diffusion draws from the inductive bias in discrete diffusion models for language, where the noising schedule and conditioning allow selective reconstruction. To address this, we will add a detailed derivation in the revised §3.2 and an appendix, explicitly showing the forward and reverse processes and how they achieve localization in latent space without masks. This will also include analysis of potential identity leakage. revision: yes
Referee: [§5.2] §5.2 (Quantitative results on arbitrary-pose try-on): The reported metrics for pose variation and cross-person transfer lack ablation on the SMPL-X conditioning strength and comparison to mask-free baselines under large body-shape differences; without these, the SOTA claim for generalizability cannot be fully assessed.

Authors: We agree that further ablations are necessary to substantiate the claims regarding generalizability across poses and body shapes. In the revised manuscript, we will expand §5.2 with an ablation on SMPL-X conditioning strength (varying the guidance scale) and additional comparisons against mask-free baselines on subsets with large body-shape discrepancies. These experiments will be conducted and reported to better support the SOTA results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on external diffusion concepts

full rationale

The paper's core derivation introduces OMFA as a bidirectional Tweedie diffusion framework inspired by discrete diffusion language models for mask-free try-on/try-off unification. No equations, self-citations, or fitted parameters are presented that reduce the target-selective denoising claim to a tautology or prior self-result by construction. The bidirectional process and SMPL-X conditioning are positioned as architectural extensions of standard latent diffusion, with performance claims resting on experiments rather than self-referential definitions or renamed inputs. This is a standard non-circular presentation of a new model.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from diffusion models and pose estimation, plus the novel unification via bidirectional process; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Bidirectional Tweedie Diffusion enables target-selective denoising in latent space to unify try-on and try-off tasks.
Invoked as the foundational mechanism for the mask-free bidirectional framework in the abstract.
domain assumption SMPL-X-based pose conditioning supports multi-view and arbitrary-pose generation from a single image.
Stated as enabling flexible outfit combinations and cross-person transfers.

pith-pipeline@v0.9.0 · 5827 in / 1457 out tokens · 54908 ms · 2026-05-19T00:13:46.586566+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OMFA is built upon a Bidirectional Tweedie Diffusion process for target-selective denoising in latent space... partial diffusion mechanism that selectively applies noise and denoising to individual components
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unifies try-on and try-off within a bidirectional framework... mask-free... SMPL-X-based pose conditioning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off
cs.CV 2026-03 unverdicted novelty 7.0

Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.