One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion
Pith reviewed 2026-05-19 00:13 UTC · model grok-4.3
The pith
A single diffusion model unifies virtual try-on and try-off without masks or exhibition garments for arbitrary poses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OMFA is a unified diffusion framework built upon a Bidirectional Tweedie Diffusion process for target-selective denoising in latent space, enabling both try-on and try-off tasks without requiring exhibition garments or segmentation masks, while supporting arbitrary poses and multi-view synthesis from a single input image via SMPL-X-based pose conditioning.
What carries the argument
The Bidirectional Tweedie Diffusion process, which performs target-selective denoising inspired by discrete diffusion language models to unify try-on and try-off operations in latent space.
If this is right
- Garment transfer works across different people without needing the original worn garment or any body segmentation.
- Outfit changes remain possible even when the target pose differs from the reference portrait.
- A single trained model handles both adding clothes and removing them instead of separate systems.
- Multi-view and arbitrary-pose outputs are produced directly from one image plus pose parameters.
Where Pith is reading between the lines
- The same selective denoising idea could apply to other image editing tasks such as object replacement or background removal.
- Training data needs might decrease because paired exhibition-garment examples are no longer required.
- Extending the process to short video sequences could enable consistent outfit changes across motion.
Load-bearing premise
The bidirectional diffusion process can reliably add or remove specific garments while keeping body shape and pose consistent without any explicit masks or reference clothing images.
What would settle it
Generate try-on results for input pairs where the person and garment come from very different body types or lighting conditions and check whether garment boundaries and textures remain accurate without visible seams or distortions.
read the original abstract
Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios; for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce OMFA (One Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. OMFA is inspired by the mask-based paradigm of discrete diffusion language models and unifies try-on and try-off within a bidirectional framework. It is built upon a Bidirectional Tweedie Diffusion process for target-selective denoising in latent space. Instead of imposing lower body constraints, OMFA is an entirely mask-free framework that requires only a single portrait and a target garment as inputs, and is designed to support flexible outfit combinations and cross-person garment transfer, making it better aligned with practical usage scenarios. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. Project page: https://onemodelforall.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OMFA, a unified diffusion framework for virtual try-on and try-off that operates without exhibition garments or segmentation masks. It employs a Bidirectional Tweedie Diffusion process inspired by discrete diffusion language models, with SMPL-X-based pose conditioning to enable arbitrary-pose and multi-view synthesis from a single portrait and target garment. The authors claim this mask-free approach supports flexible outfit combinations and cross-person transfers while achieving state-of-the-art results on both tasks.
Significance. If validated, the unification of try-on and try-off in a single mask-free model with arbitrary-pose support would be a notable advance for practical virtual garment synthesis, addressing real-world constraints like fixed poses and mask requirements that limit current methods. The LLM-inspired bidirectional mechanism and SMPL-X conditioning represent a promising direction for implicit target-selective editing in latent space.
major comments (2)
- [§3.2] §3.2 (Bidirectional Tweedie Diffusion formulation): The description of target-selective denoising relies on an implicit inductive bias from the bidirectional schedule, but no equation or derivation shows how the forward/reverse processes localize garment changes without explicit masking or exhibition garments. This mechanism is load-bearing for the central unification claim, as failure here would produce blending or identity leakage in cross-person cases.
- [§5.2] §5.2 (Quantitative results on arbitrary-pose try-on): The reported metrics for pose variation and cross-person transfer lack ablation on the SMPL-X conditioning strength and comparison to mask-free baselines under large body-shape differences; without these, the SOTA claim for generalizability cannot be fully assessed.
minor comments (2)
- [Abstract] The abstract states 'extensive experiments demonstrate SOTA' but omits dataset names, metric values, or error bars; moving a concise summary of key numbers to the abstract would improve accessibility.
- [§3] Notation for the Tweedie process parameters (e.g., the bidirectional noise schedule) is introduced without a dedicated table of symbols, making cross-references between equations harder to follow.
Simulated Author's Rebuttal
Thank you for the constructive review. We address the major comments below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Bidirectional Tweedie Diffusion formulation): The description of target-selective denoising relies on an implicit inductive bias from the bidirectional schedule, but no equation or derivation shows how the forward/reverse processes localize garment changes without explicit masking or exhibition garments. This mechanism is load-bearing for the central unification claim, as failure here would produce blending or identity leakage in cross-person cases.
Authors: We appreciate the referee pointing out the need for a clearer mathematical justification of the target-selective denoising mechanism. The Bidirectional Tweedie Diffusion draws from the inductive bias in discrete diffusion models for language, where the noising schedule and conditioning allow selective reconstruction. To address this, we will add a detailed derivation in the revised §3.2 and an appendix, explicitly showing the forward and reverse processes and how they achieve localization in latent space without masks. This will also include analysis of potential identity leakage. revision: yes
-
Referee: [§5.2] §5.2 (Quantitative results on arbitrary-pose try-on): The reported metrics for pose variation and cross-person transfer lack ablation on the SMPL-X conditioning strength and comparison to mask-free baselines under large body-shape differences; without these, the SOTA claim for generalizability cannot be fully assessed.
Authors: We agree that further ablations are necessary to substantiate the claims regarding generalizability across poses and body shapes. In the revised manuscript, we will expand §5.2 with an ablation on SMPL-X conditioning strength (varying the guidance scale) and additional comparisons against mask-free baselines on subsets with large body-shape discrepancies. These experiments will be conducted and reported to better support the SOTA results. revision: yes
Circularity Check
No significant circularity; derivation builds on external diffusion concepts
full rationale
The paper's core derivation introduces OMFA as a bidirectional Tweedie diffusion framework inspired by discrete diffusion language models for mask-free try-on/try-off unification. No equations, self-citations, or fitted parameters are presented that reduce the target-selective denoising claim to a tautology or prior self-result by construction. The bidirectional process and SMPL-X conditioning are positioned as architectural extensions of standard latent diffusion, with performance claims resting on experiments rather than self-referential definitions or renamed inputs. This is a standard non-circular presentation of a new model.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Bidirectional Tweedie Diffusion enables target-selective denoising in latent space to unify try-on and try-off tasks.
- domain assumption SMPL-X-based pose conditioning supports multi-view and arbitrary-pose generation from a single image.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OMFA is built upon a Bidirectional Tweedie Diffusion process for target-selective denoising in latent space... partial diffusion mechanism that selectively applies noise and denoising to individual components
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
unifies try-on and try-off within a bidirectional framework... mask-free... SMPL-X-based pose conditioning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off
Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.