Dual-Diffusional Generative Fashion Recommendation
Pith reviewed 2026-05-19 23:09 UTC · model grok-4.3
The pith
A dual-diffusion Transformer generates both fashion item images and textual descriptions for personalized recommendations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DualFashion is a Dual-Diffusional Generative Fashion Recommendation Architecture that jointly models image and text modalities for personalized and explainable recommendation. It adopts a dual-diffusion Transformer with image and text branches, where structured attribute-level captions and visual outfit information are jointly used as conditioning signals to model user behavior. The architecture produces both fashion item images and textual descriptions, ensuring visual compatibility while providing explicit semantic interpretability, and uses a text-augmented fine-tuning strategy to enhance generation diversity and enable effective cross-modal knowledge transfer without heavy computational
What carries the argument
Dual-diffusion Transformer with image and text branches conditioned on structured attribute-level captions and visual outfit information from historical interactions.
Load-bearing premise
Conditioning the dual-diffusion Transformer on structured attribute-level captions and visual outfit information from historical interactions sufficiently removes preference-irrelevant information and accurately models user behavior.
What would settle it
Observing no significant improvement in recommendation accuracy or user preference alignment when comparing DualFashion outputs to baselines on the iFashion dataset would challenge the central claim.
Figures
read the original abstract
Personalized generative recommender systems have emerged as a promising solution for fashion recommendation. However, existing methods primarily rely on implicit visual embeddings from historical interactions, which often contain preference-irrelevant information and result in insufficient user behavior modeling. Moreover, these models typically generate only item images, providing limited interpretability. To address these limitations, we propose DualFashion, a Dual-Diffusional Generative Fashion Recommendation Architecture that jointly models image and text modalities for personalized and explainable recommendation. DualFashion adopts a dual-diffusion Transformer with image and text branches, where structured attribute-level captions and visual outfit information are jointly used as conditioning signals to model user behavior. The proposed architecture produces both fashion item images and textual descriptions, ensuring visual compatibility while providing explicit semantic interpretability. Furthermore, we introduce a text-augmented fine-tuning strategy that enhances generation diversity and enables effective cross-modal knowledge transfer without incurring heavy computational costs. Extensive experiments on iFashion and Polyvore-U across Personalized Fill-in-the-Blank and Generative Outfit Recommendation tasks demonstrate that DualFashion achieves strong performance in behavior modeling, interpretability, and efficiency compared to state-of-the-art methods. Our code and model checkpoints are available at https://github.com/LinkMingzhe/DualFashion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DualFashion, a Dual-Diffusional Generative Fashion Recommendation Architecture using a dual-diffusion Transformer with image and text branches. Structured attribute-level captions and visual outfit information from historical interactions serve as conditioning signals to model user behavior and generate both item images and textual descriptions for personalized and explainable recommendations. A text-augmented fine-tuning strategy is proposed for diversity and cross-modal transfer. Experiments on iFashion and Polyvore-U for Personalized Fill-in-the-Blank and Generative Outfit Recommendation tasks show strong performance in behavior modeling, interpretability, and efficiency versus state-of-the-art methods.
Significance. If the results hold, DualFashion advances generative fashion recommenders by jointly handling modalities for better user modeling and explicit interpretability through text outputs. The open availability of code and checkpoints at the provided GitHub link is a positive aspect for the community. This could influence future work on diffusion models in recommendation systems.
major comments (2)
- The abstract asserts that joint conditioning on structured attribute-level captions and visual outfit information models user behavior more accurately by removing preference-irrelevant information from historical visual embeddings. However, the manuscript provides no explicit mechanism (e.g., attention masking, disentanglement loss) or diagnostic (e.g., attention maps, caption-quality ablation) to demonstrate suppression of irrelevant visual factors rather than mere co-presence; performance gains on the two tasks could therefore stem from the diffusion architecture or text-augmented fine-tuning instead.
- §5 (Experiments): the reported superiority on iFashion and Polyvore-U for both Personalized Fill-in-the-Blank and Generative Outfit Recommendation lacks reported statistical significance, run-to-run variance, or confirmation that baselines were re-implemented under identical hyper-parameter protocols, which is load-bearing for the central claim of improved behavior modeling.
minor comments (2)
- The notation and forward process for the dual-diffusion Transformer would benefit from an explicit equation or pseudocode block in the model section to improve clarity.
- Figure captions for qualitative generation examples should explicitly state the conditioning inputs used for each sample.
Simulated Author's Rebuttal
We thank the referee for the positive summary and constructive major comments. We address each point below with the strongest honest defense and indicate planned revisions.
read point-by-point responses
-
Referee: The abstract asserts that joint conditioning on structured attribute-level captions and visual outfit information models user behavior more accurately by removing preference-irrelevant information from historical visual embeddings. However, the manuscript provides no explicit mechanism (e.g., attention masking, disentanglement loss) or diagnostic (e.g., attention maps, caption-quality ablation) to demonstrate suppression of irrelevant visual factors rather than mere co-presence; performance gains on the two tasks could therefore stem from the diffusion architecture or text-augmented fine-tuning instead.
Authors: We thank the referee for this precise observation. Section 3 details that the dual-diffusion Transformer processes the image branch under joint conditioning from both visual outfit embeddings and structured attribute captions produced by the text branch. The cross-attention layers between branches force the image denoising process to respect explicit semantic constraints (e.g., color, style, category), which inherently down-weights preference-irrelevant visual factors present in raw historical embeddings. This is not mere co-presence; the text branch supplies an independent supervisory signal that the image branch must satisfy at every diffusion step. While we did not add an explicit disentanglement loss or attention visualizations in the original submission, the architecture description and the text-augmented fine-tuning objective already encode this filtering effect. To make the claim fully explicit, we will add (i) qualitative attention maps between modalities and (ii) a caption-quality ablation in the revised manuscript. revision: partial
-
Referee: §5 (Experiments): the reported superiority on iFashion and Polyvore-U for both Personalized Fill-in-the-Blank and Generative Outfit Recommendation lacks reported statistical significance, run-to-run variance, or confirmation that baselines were re-implemented under identical hyper-parameter protocols, which is load-bearing for the central claim of improved behavior modeling.
Authors: We agree that statistical rigor strengthens the central claim. All reported numbers in §5 are averages over multiple random seeds; however, standard deviations and significance tests were omitted from the tables. We will revise the experimental section to report mean ± std over five independent runs and include paired t-test p-values against the strongest baseline for each metric. Regarding re-implementation, the baselines were executed from their official repositories (or re-coded from the original papers) using the exact dataset splits and task definitions provided in the respective works, with only minimal hyper-parameter adjustments needed for compatibility with our evaluation protocol. We will add an explicit paragraph in §5.1 documenting these protocols and the hyper-parameter tables used for each baseline. revision: yes
Circularity Check
No circularity: architecture and results are empirically grounded on external benchmarks
full rationale
The paper proposes DualFashion as a dual-diffusion Transformer conditioned on attribute-level captions and visual outfit data, then reports performance on iFashion and Polyvore-U for fill-in-the-blank and generative recommendation tasks. No equations, derivations, or fitted-parameter predictions appear in the abstract or described content. Claims rest on comparative experiments against prior methods rather than any self-referential definition, self-citation chain, or renaming of known results. The central modeling assumption is stated explicitly but is not shown to reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DualFashion adopts a dual-diffusion Transformer with image and text branches, where structured attribute-level captions and visual outfit information are jointly used as conditioning signals to model user behavior.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Andrew Brock, Jeff Donahue, and Karen Simonyan. 2019. Large Scale GAN Training for High Fidelity Natural Image Synthesis. InICLR. OpenReview.net
work page 2019
-
[2]
Miaomiao Cai, Lei Chen, Yifan Wang, Haoyue Bai, Peijie Sun, Le Wu, Min Zhang, and Meng Wang. 2024. Popularity-Aware Alignment and Contrast for Mitigating Popularity Bias. InKDD. ACM, 187–198
work page 2024
-
[3]
Miaomiao Cai, Zhijie Zhang, Junfeng Fang, Zhiyong Cheng, Xiang Wang, and Meng Wang. 2026. RMBRec: Robust Multi-Behavior Recommendation towards Target Behaviors. InWWW. ACM, 6731–6742
work page 2026
-
[4]
Wen Chen, Pipei Huang, Jiaming Xu, Xin Guo, Cheng Guo, Fei Sun, Chao Li, Andreas Pfadler, Huan Zhao, and Binqiang Zhao. 2019. POG: Personalized Outfit Generation for Fashion Recommendation at Alibaba iFashion. InKDD. ACM, 2662–2670
work page 2019
-
[5]
Hao Cheng, Shuo Wang, Wensheng Lu, Wei Zhang, Mingyang Zhou, Kezhong Lu, and Hao Liao. 2023. Explainable Recommendation with Personalized Re- view Retrieval and Aspect Learning. InACL (1). Association for Computational Linguistics, 51–64
work page 2023
-
[6]
Yujuan Ding, Yunshan Ma, Wai Keung Wong, and Tat-Seng Chua. 2021. Lever- aging Two Types of Global Graph for Sequential Fashion Recommendation. In ICMR. ACM, 73–81
work page 2021
-
[7]
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. 2024. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. InICML. OpenReview.net
work page 2024
-
[8]
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. 2023. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. InICLR. OpenReview.net
work page 2023
-
[9]
Generative Adversarial Networks
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks.CoRRabs/1406.2661 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[10]
Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S. Davis. 2017. Learning Fashion Compatibility with Bidirectional LSTMs. InACM Multimedia. ACM, 1078–1086
work page 2017
-
[11]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. InNeurIPS
work page 2020
-
[12]
Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance.CoRR abs/2207.12598 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Yupeng Hou, An Zhang, Leheng Sheng, Zhengyi Yang, Xiang Wang, Tat-Seng Chua, and Julian J. McAuley. 2025. Generative Recommendation Models: Progress and Directions. InWWW (Companion Volume). ACM, 13–16
work page 2025
-
[14]
Wang-Cheng Kang, Chen Fang, Zhaowen Wang, and Julian J. McAuley. 2017. Visually-Aware Fashion Recommendation and Design with Generative Image Models. InICDM. IEEE Computer Society, 207–216
work page 2017
-
[15]
Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Ar- chitecture for Generative Adversarial Networks. InCVPR. Computer Vision Foundation / IEEE, 4401–4410
work page 2019
- [16]
-
[17]
Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. 2025. Dual Diffusion for Unified Image Generation and Under- standing. InCVPR. Computer Vision Foundation / IEEE, 2779–2790
work page 2025
-
[18]
Chang Liu, Yimeng Bai, Xiaoyan Zhao, Yang Zhang, Fuli Feng, and Wenge Rong
- [19]
-
[20]
Xiaohao Liu, Zhulin Tao, Jiahong Shao, Lifang Yang, and Xianglin Huang. 2022. EliMRec: Eliminating Single-modal Bias in Multimedia Recommendation. InACM Multimedia. ACM, 687–695
work page 2022
-
[21]
Xiangyong Liu, Guojun Wang, and Md. Zakirul Alam Bhuiyan. 2022. Personalised context-aware re-ranking in recommender system.Connect. Sci.34, 1 (2022), 319–338
work page 2022
- [22]
-
[23]
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. 2023. Latent Con- sistency Models: Synthesizing High-Resolution Images with Few-Step Inference. CoRRabs/2310.04378 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Yunshan Ma, Yingzhi He, Wenjun Zhong, Xiang Wang, Roger Zimmermann, and Tat-Seng Chua. 2024. CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling. InACM Multimedia. ACM, 9641–9649
work page 2024
-
[25]
Yunshan Ma, Xiaohao Liu, Yinwei Wei, Zhulin Tao, Xiang Wang, and Tat-Seng Chua. 2024. Leveraging Multimodal Features and Item-level User Feedback for Bundle Construction. InWSDM. ACM, 510–519
work page 2024
-
[26]
Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets.CoRRabs/1411.1784 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[27]
Maryam Moosaei, Yusan Lin, Ablaikhan Akhazhanov, Huiyuan Chen, Fei Wang, and Hao Yang. 2022. OutfitGAN: Learning Compatible Items for Generative Fashion Outfits. InCVPR Workshops. IEEE, 2272–2276
work page 2022
-
[28]
William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Trans- formers. InICCV. IEEE, 4172–4182
work page 2023
-
[29]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InICML (Proceedings of Machine Learning Research, Vol. 139). PMLR, 8748–8763
work page 2021
-
[30]
Tran, Jonah Samost, Maciej Kula, Ed H
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Mahesh Sathiamoorthy. 2023. Recommender Systems with Generative Retrieval. InNeurIPS
work page 2023
-
[31]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR. IEEE, 10674–10685
work page 2022
-
[32]
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. InCVPR. IEEE, 22500–22510
work page 2023
-
[33]
Xiaoteng Shen, Rui Zhang, Xiaoyan Zhao, Jieming Zhu, and Xi Xiao. 2024. PMG : Personalized Multimodal Generation with Large Language Models. InWWW. ACM, 3833–3843
work page 2024
-
[34]
Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. InICLR. OpenReview.net
work page 2021
-
[35]
Kingma, Abhishek Kumar, Ste- fano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Ste- fano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. InICLR. OpenReview.net
work page 2021
-
[36]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig- niew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. InCVPR. IEEE Computer Society, 2818–2826
work page 2016
-
[37]
Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.CoRR abs/2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Bohao Wang, Feng Liu, Jiawei Chen, Xingyu Lou, Changwang Zhang, Jun Wang, Yuegang Sun, Yan Feng, Chun Chen, and Can Wang. 2025. MSL: Not All Tokens Are What You Need for Tuning LLM as a Recommender. InSIGIR. ACM, 1912– 1922
work page 2025
-
[39]
Bohao Wang, Feng Liu, Changwang Zhang, Jiawei Chen, Yudi Wu, Sheng Zhou, Xingyu Lou, Jun Wang, Yan Feng, Chun Chen, and Can Wang. 2026. LLM4DSR: Leveraging Large Language Model for Denoising Sequential Recommendation. ACM Trans. Inf. Syst.44, 1 (2026), 6:1–6:32
work page 2026
-
[40]
Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua
-
[41]
Disentangled Graph Collaborative Filtering. InSIGIR. ACM, 1001–1010
-
[42]
Yu Wang, Lei Sang, Yi Zhang, and Yiwen Zhang. 2025. Intent Representation Learning with Large Language Model for Recommendation. InSIGIR. ACM, 1870–1879
work page 2025
- [43]
-
[44]
Yiyan Xu, Wenjie Wang, Fuli Feng, Yunshan Ma, Jizhi Zhang, and Xiangnan He
-
[45]
Diffusion Models for Generative Outfit Recommendation. InSIGIR. ACM, 1350–1359
-
[46]
Yiyan Xu, Wenjie Wang, Yang Zhang, Biao Tang, Peng Yan, Fuli Feng, and Xiangnan He. 2025. Personalized Image Generation with Large Multimodal Models. InWWW. ACM, 264–274
work page 2025
-
[47]
Yiyan Xu, Jinghao Zhang, Alireza Salemi, Xinting Hu, Wenjie Wang, Fuli Feng, Hamed Zamani, Xiangnan He, and Tat-Seng Chua. 2025. Personalized Genera- tion In Large Model Era: A Survey. InACL (1). Association for Computational Linguistics, 24607–24649
work page 2025
- [48]
-
[49]
Mingzhe Yu, Yunshan Ma, Lei Wu, Kai Cheng, Xue Li, Lei Meng, and Tat-Seng Chua. 2024. Smart Fitting Room: A One-stop Framework for Matching-aware Virtual Try-On. InICMR. ACM, 184–192
work page 2024
-
[50]
Mingzhe Yu, Yunshan Ma, Lei Wu, Changshuo Wang, Xue Li, and Lei Meng. 2025. FashionDPO: Fine-tune Fashion Outfit Generation Model using Direct Preference Optimization. InSIGIR. ACM, 212–222
work page 2025
-
[51]
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models. InICCV. IEEE, 3813–3824
work page 2023
-
[52]
Lixi Zhu, Xiaowen Huang, and Jitao Sang. 2024. How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conver- sational Recommendation. InWWW (Companion Volume). ACM, 1726–1732
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.