DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation

Boheng Zhang; Dewen Fan; Fan Yang; Honglie Wang; Huaiqing Wang; Huan Ouyang; Jia Sun; Jiyuan Wang; Junlong Wu; Pengcheng Wei

arxiv: 2607.02220 · v1 · pith:QRDI7QYPnew · submitted 2026-07-02 · 💻 cs.CV

DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation

Zijun Li , Yimin Zhou , Jia Sun , Honglie Wang , Pengcheng Wei , Junlong Wu , Yongrui Heng , Jiyuan Wang

show 7 more authors

Huan Ouyang Boheng Zhang Huaiqing Wang Dewen Fan Qianqian Gan Fan Yang Tingting Gao

This is my paper

Pith reviewed 2026-07-03 15:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords fashion detail generationcross-modal feature alignmentdiffusion transformerDINOv3 distillationfocus conditioningconsistency reward modelvirtual try-on

0 comments

The pith

DetailAnywhere generates photorealistic fashion details from focus markers on garment images via cross-modal distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fashion Detail Generation as a new task where a model must produce a close-up view of a specific garment region, such as a collar or cuff, indicated only by a focus marker on a reference image. It releases FDBench, a benchmark of over 40,000 human-verified reference-detail pairs across 41 categories, to measure performance on this setting. The core technique, Cross-Modal Feature Alignment Distillation, uses a fine-tuned DINOv3 teacher to align the branches of a Multimodal Diffusion Transformer in a shared semantic space. A consistency reward model further guides reinforcement learning to maintain garment identity. The resulting DetailAnywhere model outperforms prior open-source methods on standard metrics and in human evaluations.

Core claim

Cross-Modal Feature Alignment Distillation aligns both branches of a Multimodal Diffusion Transformer in a shared semantic space using a fine-tuned DINOv3 teacher, enabling the model to bridge the correspondence between a focus marker and the corresponding photorealistic close-up while preserving garment identity without any precise prompt.

What carries the argument

Cross-Modal Feature Alignment Distillation (CFAD), which performs dual-branch distillation from a DINOv3 teacher into a Multimodal Diffusion Transformer.

If this is right

E-commerce platforms can let users request and receive close-up views of specific garment features directly from product photos.
Generated details remain consistent with the reference image's identity even without text descriptions of the detail.
The consistency reward model provides a way to optimize diffusion outputs for paired image quality along multiple axes.
Non-template detail synthesis becomes feasible for apparel visualization tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment strategy might apply to detail generation in other domains such as product design or interior visualization.
FDBench could serve as a testbed for studying focus-conditioned generation beyond fashion.
The dual-branch distillation might combine with other teacher models to handle additional conditioning signals.

Load-bearing premise

A fine-tuned DINOv3 teacher can align both branches of a Multimodal Diffusion Transformer in a shared semantic space to bridge the correspondence between a focus marker and a photorealistic close-up while preserving garment identity without precise prompts.

What would settle it

An experiment in which DetailAnywhere fails to outperform state-of-the-art open-source methods on FDBench quantitative metrics or in human evaluations would show the distillation approach does not solve the semantic gap.

Figures

Figures reproduced from arXiv: 2607.02220 by Boheng Zhang, Dewen Fan, Fan Yang, Honglie Wang, Huaiqing Wang, Huan Ouyang, Jia Sun, Jiyuan Wang, Junlong Wu, Pengcheng Wei, Qianqian Gan, Tingting Gao, Yimin Zhou, Yongrui Heng, Zijun Li.

**Figure 1.** Figure 1: Fashion Detail Generation. Given a reference garment image with a bounding box indicating the target region, our model generates a high-fidelity, identity-consistent close-up of the indicated area. Abstract Diffusion-based generative AI has achieved remarkable success in e-commerce applications such as virtual try-on, poster generation, and product background synthesis. However, when making online purchas… view at source ↗

**Figure 2.** Figure 2: Overview of DetailAnywhere. Our framework consists of two stages: (1) CFAD trains a view-bridging teacher to align reference and detail representations, then distills this alignment into both branches of the generator; (2) A consistency reward model fine-tunes the generator via Negative-aware FineTuning to further improve identity preservation and detail fidelity. EditScore [57] focus specifically on the q… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison. Side-by-side visualization of detail generation results from JoyAI-Image-Edit [77], Qwen-Image-Edit (PE) [14], FireRed-Image-Edit (PE) [76], Seedream 5.0 Lite [80], and DetailAnywhere (Ours). Our method produces identity-consistent, photorealistic close-ups with faithful texture and structural preservation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Separation ratio across MMDiT layers. Ratio of the separation score (w/ CFAD / w/o CFAD) for image and text branches. Both branches peak at the alignment target layer 16 (l ∗ ), with the text branch reaching 58.7×. A.8 Additional Visualizations In addition to the main paper visualization, we provide more side-by-side comparisons of generated detail images from different methods on representative FDBench ex… view at source ↗

**Figure 5.** Figure 5: Additional Fashion Detail Visualizations. Additional visualizations. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

read the original abstract

Diffusion-based generative AI has achieved remarkable success in e-commerce applications such as virtual try-on, poster generation, and product background synthesis. However, when making online purchasing decisions for apparel, consumers also desire the freedom to examine specific detail regions of interest, such as collars, cuffs, and fabric textures, yet existing methods have not explicitly studied this setting. We therefore formalize a new, non-template task: Fashion Detail Generation with focus conditioning, and release FDBench, the first benchmark comprising 40K+ human-verified reference-detail pairs across 41 different categories. This task poses a unique semantic gap challenge: the model must bridge the correspondence between a focus marker on a product reference image and a photorealistic close-up view of the indicated region, while faithfully preserving the garment's identity, without any precise prompt. To bridge this gap, we propose Cross-modal Feature Alignment Distillation (CFAD), which leverages a fine-tuned DINOv3 teacher to align both branches of a Multimodal Diffusion Transformer in a shared semantic space via dual-branch distillation. To further improve consistency between generated details and reference images, we introduce a consistency reward model that jointly scores image pairs along three quality axes and optimizes generation via reinforcement learning. Experiments show that our model DetailAnywhere significantly outperforms all state-of-the-art opensource methods across all metrics and human evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new task and releases the first benchmark for focus-conditioned fashion detail generation, but the abstract gives no metrics or setup to check the outperformance claim.

read the letter

The paper's main contribution is naming a specific gap: existing diffusion methods for e-commerce apparel do not handle requests for close-up views of marked regions like collars or fabric without a detailed prompt. They formalize this as Fashion Detail Generation and back it with FDBench, a set of 40k+ human-verified reference-detail pairs across 41 categories. That dataset is new and could serve as a starting point for anyone building tools that let users inspect specific garment parts.

The method uses cross-modal feature alignment distillation with a fine-tuned DINOv3 teacher to pull both branches of a multimodal diffusion transformer into one semantic space, plus a reward model that scores pairs on three quality axes and applies reinforcement learning for better consistency. This targets the correspondence problem they describe while trying to preserve garment identity.

The weakness is that the abstract asserts significant gains over all open-source baselines on metrics and human evaluations but includes zero tables, baselines, ablations, or validation procedures. Without those, the claim cannot be assessed, and the assumption that the DINOv3 teacher reliably bridges the marker to the output remains untested in the provided text. The circularity burden is also unclear because no equations or training details appear.

This work is for applied researchers in retail image generation who need a benchmark for detail-focused tasks. A reader looking for a narrow e-commerce application might find the task definition and dataset useful. It deserves peer review because the task is cleanly stated and the benchmark is released, even though the current evidence for the method's effectiveness is missing.

Referee Report

0 major / 1 minor

Summary. The paper formalizes a new task called Fashion Detail Generation with focus conditioning for e-commerce apparel, releases the FDBench benchmark with 40K+ human-verified reference-detail pairs across 41 categories, proposes Cross-modal Feature Alignment Distillation (CFAD) that uses a fine-tuned DINOv3 teacher to align both branches of a Multimodal Diffusion Transformer in a shared semantic space, introduces a consistency reward model scoring image pairs on three axes and optimized via reinforcement learning, and claims that the resulting DetailAnywhere model significantly outperforms all state-of-the-art open-source methods across all metrics and human evaluations.

Significance. If the outperformance claims hold under rigorous validation, the work would be significant for advancing controllable generative models in fashion e-commerce by addressing an unstudied detail-region task without requiring precise prompts, while providing the first dedicated benchmark; the distillation and RL consistency components represent a targeted technical approach to the semantic gap problem.

minor comments (1)

The abstract states outperformance 'across all metrics' but does not name the specific metrics, baselines, or quantitative values, which hinders immediate assessment of the central empirical claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for acknowledging the task formalization, FDBench benchmark, and the proposed CFAD distillation plus RL consistency components. We note that the report lists no specific major comments, only an overall summary and a conditional significance assessment. We address the recommendation of 'uncertain' by confirming that all experimental claims in the manuscript are supported by the provided metrics, human evaluations, and open-source baselines as described.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided information consists solely of the abstract, which describes the task, benchmark, and proposed method at a high level without any equations, derivations, or specific mathematical steps. No load-bearing claims reduce to self-definitions or fitted inputs by construction, as no such details are available for inspection. The central claim of outperformance is presented as an experimental result rather than a derived prediction from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5820 in / 1158 out tokens · 30385 ms · 2026-07-03T15:54:53.009832+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 29 canonical work pages · 15 internal anchors

[1]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[2]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[3]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorber, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

2024
[4]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[5]

Tryondiffusion: A tale of two unets

Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[6]

Improving diffusion models for authentic virtual try-on in the wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for authentic virtual try-on in the wild. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[7]

Postermaker: Towards high-quality product poster generation with accurate text rendering

Yifan Gao, Zihang Lin, Chuanbin Liu, Min Zhou, Tiezheng Ge, Bo Zheng, and Hongtao Xie. Postermaker: Towards high-quality product poster generation with accurate text rendering. arXiv preprint arXiv:2504.06632, 2025

work page arXiv 2025
[8]

Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

work page arXiv 2026
[9]

Multimodal garment designer: Human-centric latent diffusion models for fashion image editing

Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Multimodal garment designer: Human-centric latent diffusion models for fashion image editing. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[10]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow im- age editing instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[11]

Enhancing virtual try-on with synthetic pairs and error-aware noise scheduling

Nannan Li, Kevin J Shih, and Bryan A Plummer. Enhancing virtual try-on with synthetic pairs and error-aware noise scheduling. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21238–21247, 2025

2025
[12]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016
[13]

E-comiq-zh: A human-aligned dataset and bench- mark for fine-grained evaluation of e-commerce posters with chain-of-thought

Meiqi Sun, Mingyu Li, and Junxiong Zhu. E-comiq-zh: A human-aligned dataset and bench- mark for fine-grained evaluation of e-commerce posters with chain-of-thought. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[14]

Qwen-Image Technical Report

Qwen Team. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

FLUX.2: Frontier visual intelligence

Black Forest Labs. FLUX.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025

2025
[16]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think

Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, and Baobao Chang. Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6146–6156, 2025

2025
[18]

Aligning text to image in diffusion models is easier than you think.arXiv preprint arXiv:2503.08250, 2025

Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, and Jong Chul Ye. Aligning text to image in diffusion models is easier than you think.arXiv preprint arXiv:2503.08250, 2025

work page arXiv 2025
[19]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

2023
[20]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations (ICLR), 2025

2025
[21]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

2024
[22]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

2021
[23]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2015

2015
[24]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems (NeurIPS), 2021

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[26]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

2021
[28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[29]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.International Conference on Learning Representations (ICLR), 2023

2023
[30]

Edit in 2D, Verify in 3D: Reinforcement Learning for Multi-view Consistent Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xi- angxiang Chu, Yunchao Wei, Kang Liao, et al. Geometry-guided reinforcement learning for multi-view consistent 3d scene editing.arXiv preprint arXiv:2603.03143, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Learning to generate stylized handwritten text via a unified representation of style, content, and noise

Honglie Wang, Yan-Ming Zhang, Wangzi Yao, Fei Yin, and Cheng-Lin Liu. Learning to generate stylized handwritten text via a unified representation of style, content, and noise. In The Fourteenth International Conference on Learning Representations, 2026

2026
[32]

Template-guided cascaded diffusion for stylized handwritten chinese text-line generation

Honglie Wang, Minsi Ren, Yan-Ming Zhang, Fei Yin, and Cheng-Lin Liu. Template-guided cascaded diffusion for stylized handwritten chinese text-line generation. InInternational Conference on Document Analysis and Recognition, pages 149–166. Springer, 2025

2025
[33]

Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.Advances in Neural Information Processing Systems, 38:23452–23474, 2026

Nanxu Gong, Zijun Li, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, and Yanjie Fu. Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.Advances in Neural Information Processing Systems, 38:23452–23474, 2026. 12

2026
[34]

From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, and Yao Zhao. From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

work page arXiv 2025
[35]

Symmcompletion: High-fidelity and high-consistency point cloud completion with symmetry guidance

Hongyu Yan, Zijun Li, Kunming Luo, Li Lu, and Ping Tan. Symmcompletion: High-fidelity and high-consistency point cloud completion with symmetry guidance. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9094–9102, 2025

2025
[36]

Diffpc: Diffusion-based high perceptual fidelity image compression with semantic refinement

Yichong Xia, Yimin Zhou, Jinpeng Wang, Baoyi An, Haoqian Wang, Yaowei Wang, and Bin Chen. Diffpc: Diffusion-based high perceptual fidelity image compression with semantic refinement. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[37]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023
[38]

Sdedit: Guided image synthesis and editing with stochastic differential equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations (ICLR), 2022

2022
[39]

Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025

Zijun Li, Hongyu Yan, Shijie Li, Kunming Luo, Li Lu, Xulei Yang, and Weisi Lin. Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025

work page arXiv 2025
[40]

Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, and Yao Zhao. Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

work page arXiv 2025
[41]

Posemaster: Generating 3d characters in arbitrary poses from a single image.arXiv preprint arXiv:2506.21076, 2025

Hongyu Yan, Kunming Luo, Weiyu Li, Yixun Liang, Shengming Li, Jingwei Huang, Chunchao Guo, and Ping Tan. Posemaster: Generating 3d characters in arbitrary poses from a single image.arXiv preprint arXiv:2506.21076, 2025

work page arXiv 2025
[42]

Digging into con- trastive learning for robust depth estimation with diffusion models

Jiyuan Wang, Chunyu Lin, Lang Nie, Kang Liao, Shuwei Shao, and Yao Zhao. Digging into con- trastive learning for robust depth estimation with diffusion models. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 4129–4137. ACM, October 2024. doi: 10.1145/3664647.3681168. URLhttp://dx.doi.org/10.1145/3664647.3681168

work page doi:10.1145/3664647.3681168 2024
[43]

Weath- erdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions

Jiyuan Wang, Chunyu Lin, Lang Nie, Shujun Huang, Yao Zhao, Xing Pan, and Rui Ai. Weath- erdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions. In2024 IEEE International Conference on Robotics and Automation (ICRA), page 4976–4982. IEEE, May 2024. doi: 10.1109/icra57147.2024.10611100. URL http://dx.d...

work page doi:10.1109/icra57147.2024.10611100 2024
[44]

Denoising diffusion autoencoders are unified self-supervised learners

Weilai Xiang, Hongyu Yang, Di Huang, and Yunhong Wang. Denoising diffusion autoencoders are unified self-supervised learners. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[45]

Emer- gent correspondence from image diffusion

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent correspondence from image diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[46]

Cross modal distillation for supervision transfer

Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016
[47]

Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on

Jeongho Kim, Gyojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[48]

Catvton: Concatenation is all you need for virtual try-on with diffusion models

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models. InInternational Conference on Learning Representations (ICLR), 2025. 13

2025
[49]

Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on

Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. InACM Multimedia, 2023

2023
[50]

Texture- preserving diffusion models for high-fidelity virtual try-on

Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, Jin Tao, and Xiangmin Xu. Texture- preserving diffusion models for high-fidelity virtual try-on. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[51]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

2024
[52]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[53]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[55]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

work page arXiv 2025
[57]

Editscore: Unlocking online RL for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, and Zheng Liu. Editscore: Unlocking online RL for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

work page arXiv 2025
[58]

Deepfm: A factorization-machine based neural network for ctr prediction

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: A factorization-machine based neural network for ctr prediction. InInternational Joint Conference on Artificial Intelligence (IJCAI), 2017

2017
[59]

Autopp: Towards automated product poster generation and optimization, 2025

Jiahao Fan, Yuxin Qin, Wei Feng, Yanyin Chen, Yaoyu Li, Ao Ma, Yixiu Li, Li Zhuang, Haoyi Bian, Zheng Zhang, Jingjing Lv, Junjie Shen, and Ching Law. Autopp: Towards automated product poster generation and optimization, 2025. URL https://arxiv.org/abs/2512. 21921

2025
[60]

Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841– 63869, 2025

Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841– 63869, 2025

2025
[61]

Arc: Robots adaptive risk-aware robust control via distributional reinforcement learning

Junlong Wu, Yi Cheng, Hang Liu, and Houde Liu. Arc: Robots adaptive risk-aware robust control via distributional reinforcement learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10656–10663. IEEE, 2025

2025
[62]

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, et al. Cac: Advancing video reward models via hierarchical spatiotemporal concentrating.arXiv preprint arXiv:2605.11723, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[63]

High quality underwater image compression with adaptive color correction

Yimin Zhou, Yichong Xia, Sicheng Pan, Bin Chen, Yaowei Li, Jiawei Li, Mingyao Hong, Zhi Wang, and Yaowei Wang. High quality underwater image compression with adaptive color correction. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8587–8591. IEEE, 2026. 14

2026
[64]

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

Yongrui Heng, Chaoya Jiang, Han Yang, Shikun Zhang, and Wei Ye. Eve: Verifiable self- evolution of mllms via executable visual transformations.arXiv preprint arXiv:2604.18320, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

FLUX.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. FLUX.https://github.com/black-forest-labs/flux, 2024

2024
[66]

Text embedding is not all you need: Attention control for text-to-image semantic alignment with text self-attention maps.arXiv preprint arXiv:2411.15236, 2024

Jeeyung Kim, Erfan Esmaeili, and Qiang Qiu. Text embedding is not all you need: Attention control for text-to-image semantic alignment with text self-attention maps.arXiv preprint arXiv:2411.15236, 2024

work page arXiv 2024
[67]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025
[68]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

HPSv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

Yuhang Ma et al. HPSv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

work page arXiv 2025
[71]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[73]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, et al. Qwen3- vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. In NeurIPS, 2025

2025
[76]

FireRed-Image-Edit-1.0 technical report.arXiv preprint arXiv:2602.13344, 2026

Super Intelligence Team. FireRed-Image-Edit-1.0 technical report.arXiv preprint arXiv:2602.13344, 2026

work page arXiv 2026
[77]

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, et al. Awaking spatial intelligence in unified multimodal understanding and generation.arXiv preprint arXiv:2605.04128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[78]

GPT-Image-1: Image generation in the OpenAI API

OpenAI. GPT-Image-1: Image generation in the OpenAI API. https://openai.com/index/ image-generation-api/, 2025

2025
[79]

Nano Banana 2 (Gemini 3.1 Flash Image Preview)

Google DeepMind. Nano Banana 2 (Gemini 3.1 Flash Image Preview). https: //ai.google.dev/gemini-api/docs/image-generation, 2026. Model ID: gemini-3.1-flash-image-preview; accessed 2026-05-07

2026
[80]

Seedream 5.0 Lite

ByteDance Seed. Seedream 5.0 Lite. https://seed.bytedance.com/en/seedream5_0_ lite, 2026

2026

Showing first 80 references.

[1] [1]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020

[2] [2]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[3] [3]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorber, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

2024

[4] [4]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021

[5] [5]

Tryondiffusion: A tale of two unets

Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[6] [6]

Improving diffusion models for authentic virtual try-on in the wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. Improving diffusion models for authentic virtual try-on in the wild. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[7] [7]

Postermaker: Towards high-quality product poster generation with accurate text rendering

Yifan Gao, Zihang Lin, Chuanbin Liu, Min Zhou, Tiezheng Ge, Bo Zheng, and Hongtao Xie. Postermaker: Towards high-quality product poster generation with accurate text rendering. arXiv preprint arXiv:2504.06632, 2025

work page arXiv 2025

[8] [8]

Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

work page arXiv 2026

[9] [9]

Multimodal garment designer: Human-centric latent diffusion models for fashion image editing

Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Multimodal garment designer: Human-centric latent diffusion models for fashion image editing. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[10] [10]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow im- age editing instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[11] [11]

Enhancing virtual try-on with synthetic pairs and error-aware noise scheduling

Nannan Li, Kevin J Shih, and Bryan A Plummer. Enhancing virtual try-on with synthetic pairs and error-aware noise scheduling. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21238–21247, 2025

2025

[12] [12]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016

[13] [13]

E-comiq-zh: A human-aligned dataset and bench- mark for fine-grained evaluation of e-commerce posters with chain-of-thought

Meiqi Sun, Mingyu Li, and Junxiong Zhu. E-comiq-zh: A human-aligned dataset and bench- mark for fine-grained evaluation of e-commerce posters with chain-of-thought. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[14] [14]

Qwen-Image Technical Report

Qwen Team. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

FLUX.2: Frontier visual intelligence

Black Forest Labs. FLUX.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025

2025

[16] [16]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think

Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, and Baobao Chang. Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6146–6156, 2025

2025

[18] [18]

Aligning text to image in diffusion models is easier than you think.arXiv preprint arXiv:2503.08250, 2025

Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, and Jong Chul Ye. Aligning text to image in diffusion models is easier than you think.arXiv preprint arXiv:2503.08250, 2025

work page arXiv 2025

[19] [19]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

2023

[20] [20]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations (ICLR), 2025

2025

[21] [21]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

2024

[22] [22]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

2021

[23] [23]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, 2015

2015

[24] [24]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems (NeurIPS), 2021

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems (NeurIPS), 2021

2021

[26] [26]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

2021

[28] [28]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[29] [29]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.International Conference on Learning Representations (ICLR), 2023

2023

[30] [30]

Edit in 2D, Verify in 3D: Reinforcement Learning for Multi-view Consistent Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xi- angxiang Chu, Yunchao Wei, Kang Liao, et al. Geometry-guided reinforcement learning for multi-view consistent 3d scene editing.arXiv preprint arXiv:2603.03143, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Learning to generate stylized handwritten text via a unified representation of style, content, and noise

Honglie Wang, Yan-Ming Zhang, Wangzi Yao, Fei Yin, and Cheng-Lin Liu. Learning to generate stylized handwritten text via a unified representation of style, content, and noise. In The Fourteenth International Conference on Learning Representations, 2026

2026

[32] [32]

Template-guided cascaded diffusion for stylized handwritten chinese text-line generation

Honglie Wang, Minsi Ren, Yan-Ming Zhang, Fei Yin, and Cheng-Lin Liu. Template-guided cascaded diffusion for stylized handwritten chinese text-line generation. InInternational Conference on Document Analysis and Recognition, pages 149–166. Springer, 2025

2025

[33] [33]

Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.Advances in Neural Information Processing Systems, 38:23452–23474, 2026

Nanxu Gong, Zijun Li, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, and Yanjie Fu. Sculpting features from noise: Reward-guided hierarchical diffusion for task-optimal feature transformation.Advances in Neural Information Processing Systems, 38:23452–23474, 2026. 12

2026

[34] [34]

From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, and Yao Zhao. From editor to dense geometry estimator.arXiv preprint arXiv:2509.04338, 2025

work page arXiv 2025

[35] [35]

Symmcompletion: High-fidelity and high-consistency point cloud completion with symmetry guidance

Hongyu Yan, Zijun Li, Kunming Luo, Li Lu, and Ping Tan. Symmcompletion: High-fidelity and high-consistency point cloud completion with symmetry guidance. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9094–9102, 2025

2025

[36] [36]

Diffpc: Diffusion-based high perceptual fidelity image compression with semantic refinement

Yichong Xia, Yimin Zhou, Jinpeng Wang, Baoyi An, Haoqian Wang, Yaowei Wang, and Bin Chen. Diffpc: Diffusion-based high perceptual fidelity image compression with semantic refinement. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[37] [37]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023

[38] [38]

Sdedit: Guided image synthesis and editing with stochastic differential equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations (ICLR), 2022

2022

[39] [39]

Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025

Zijun Li, Hongyu Yan, Shijie Li, Kunming Luo, Li Lu, Xulei Yang, and Weisi Lin. Diffpcn: Latent diffusion model based on multi-view depth images for point cloud completion.arXiv preprint arXiv:2509.23723, 2025

work page arXiv 2025

[40] [40]

Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

Jiyuan Wang, Chunyu Lin, Cheng Guan, Lang Nie, Jing He, Haodong Li, Kang Liao, and Yao Zhao. Jasmine: Harnessing diffusion prior for self-supervised depth estimation.arXiv preprint arXiv:2503.15905, 2025

work page arXiv 2025

[41] [41]

Posemaster: Generating 3d characters in arbitrary poses from a single image.arXiv preprint arXiv:2506.21076, 2025

Hongyu Yan, Kunming Luo, Weiyu Li, Yixun Liang, Shengming Li, Jingwei Huang, Chunchao Guo, and Ping Tan. Posemaster: Generating 3d characters in arbitrary poses from a single image.arXiv preprint arXiv:2506.21076, 2025

work page arXiv 2025

[42] [42]

Digging into con- trastive learning for robust depth estimation with diffusion models

Jiyuan Wang, Chunyu Lin, Lang Nie, Kang Liao, Shuwei Shao, and Yao Zhao. Digging into con- trastive learning for robust depth estimation with diffusion models. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 4129–4137. ACM, October 2024. doi: 10.1145/3664647.3681168. URLhttp://dx.doi.org/10.1145/3664647.3681168

work page doi:10.1145/3664647.3681168 2024

[43] [43]

Weath- erdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions

Jiyuan Wang, Chunyu Lin, Lang Nie, Shujun Huang, Yao Zhao, Xing Pan, and Rui Ai. Weath- erdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions. In2024 IEEE International Conference on Robotics and Automation (ICRA), page 4976–4982. IEEE, May 2024. doi: 10.1109/icra57147.2024.10611100. URL http://dx.d...

work page doi:10.1109/icra57147.2024.10611100 2024

[44] [44]

Denoising diffusion autoencoders are unified self-supervised learners

Weilai Xiang, Hongyu Yang, Di Huang, and Yunhong Wang. Denoising diffusion autoencoders are unified self-supervised learners. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[45] [45]

Emer- gent correspondence from image diffusion

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent correspondence from image diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[46] [46]

Cross modal distillation for supervision transfer

Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision transfer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016

[47] [47]

Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on

Jeongho Kim, Gyojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo. Stableviton: Learning semantic correspondence with latent diffusion model for virtual try-on. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[48] [48]

Catvton: Concatenation is all you need for virtual try-on with diffusion models

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, and Xiaodan Liang. Catvton: Concatenation is all you need for virtual try-on with diffusion models. InInternational Conference on Learning Representations (ICLR), 2025. 13

2025

[49] [49]

Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on

Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. Ladi-vton: Latent diffusion textual-inversion enhanced virtual try-on. InACM Multimedia, 2023

2023

[50] [50]

Texture- preserving diffusion models for high-fidelity virtual try-on

Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, Jin Tao, and Xiangmin Xu. Texture- preserving diffusion models for high-fidelity virtual try-on. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[51] [51]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

2024

[52] [52]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[53] [53]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[55] [55]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

work page arXiv 2025

[57] [57]

Editscore: Unlocking online RL for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, and Zheng Liu. Editscore: Unlocking online RL for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

work page arXiv 2025

[58] [58]

Deepfm: A factorization-machine based neural network for ctr prediction

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: A factorization-machine based neural network for ctr prediction. InInternational Joint Conference on Artificial Intelligence (IJCAI), 2017

2017

[59] [59]

Autopp: Towards automated product poster generation and optimization, 2025

Jiahao Fan, Yuxin Qin, Wei Feng, Yanyin Chen, Yaoyu Li, Ao Ma, Yixiu Li, Li Zhuang, Haoyi Bian, Zheng Zhang, Jingjing Lv, Junjie Shen, and Ching Law. Autopp: Towards automated product poster generation and optimization, 2025. URL https://arxiv.org/abs/2512. 21921

2025

[60] [60]

Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841– 63869, 2025

Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.Advances in Neural Information Processing Systems, 38:63841– 63869, 2025

2025

[61] [61]

Arc: Robots adaptive risk-aware robust control via distributional reinforcement learning

Junlong Wu, Yi Cheng, Hang Liu, and Houde Liu. Arc: Robots adaptive risk-aware robust control via distributional reinforcement learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10656–10663. IEEE, 2025

2025

[62] [62]

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, et al. Cac: Advancing video reward models via hierarchical spatiotemporal concentrating.arXiv preprint arXiv:2605.11723, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [63]

High quality underwater image compression with adaptive color correction

Yimin Zhou, Yichong Xia, Sicheng Pan, Bin Chen, Yaowei Li, Jiawei Li, Mingyao Hong, Zhi Wang, and Yaowei Wang. High quality underwater image compression with adaptive color correction. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8587–8591. IEEE, 2026. 14

2026

[64] [64]

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

Yongrui Heng, Chaoya Jiang, Han Yang, Shikun Zhang, and Wei Ye. Eve: Verifiable self- evolution of mllms via executable visual transformations.arXiv preprint arXiv:2604.18320, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[65] [65]

FLUX.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. FLUX.https://github.com/black-forest-labs/flux, 2024

2024

[66] [66]

Text embedding is not all you need: Attention control for text-to-image semantic alignment with text self-attention maps.arXiv preprint arXiv:2411.15236, 2024

Jeeyung Kim, Erfan Esmaeili, and Qiang Qiu. Text embedding is not all you need: Attention control for text-to-image semantic alignment with text self-attention maps.arXiv preprint arXiv:2411.15236, 2024

work page arXiv 2024

[67] [67]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025

[68] [68]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

HPSv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

Yuhang Ma et al. HPSv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

work page arXiv 2025

[71] [71]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018

[73] [73]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, et al. Qwen3- vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [75]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. In NeurIPS, 2025

2025

[76] [76]

FireRed-Image-Edit-1.0 technical report.arXiv preprint arXiv:2602.13344, 2026

Super Intelligence Team. FireRed-Image-Edit-1.0 technical report.arXiv preprint arXiv:2602.13344, 2026

work page arXiv 2026

[77] [77]

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, et al. Awaking spatial intelligence in unified multimodal understanding and generation.arXiv preprint arXiv:2605.04128, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[78] [78]

GPT-Image-1: Image generation in the OpenAI API

OpenAI. GPT-Image-1: Image generation in the OpenAI API. https://openai.com/index/ image-generation-api/, 2025

2025

[79] [79]

Nano Banana 2 (Gemini 3.1 Flash Image Preview)

Google DeepMind. Nano Banana 2 (Gemini 3.1 Flash Image Preview). https: //ai.google.dev/gemini-api/docs/image-generation, 2026. Model ID: gemini-3.1-flash-image-preview; accessed 2026-05-07

2026

[80] [80]

Seedream 5.0 Lite

ByteDance Seed. Seedream 5.0 Lite. https://seed.bytedance.com/en/seedream5_0_ lite, 2026

2026