MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

Chubin Chen; Henglin Liu; Jie Guo; Kaer Huang; Nisha Huang; Tong-Yee Lee; Xiu Li; Yizhou Lin

arxiv: 2605.15660 · v1 · pith:3I2BXB52new · submitted 2026-05-15 · 💻 cs.CV

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

Nisha Huang , Henglin Liu , Yizhou Lin , Kaer Huang , Chubin Chen , Jie Guo , Tong-Yee Lee , Xiu Li This is my paper

Pith reviewed 2026-05-20 18:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords material transferdiffusion transformerzero-shotimage integrationmulti-modal attentiontraining-free

0 comments

The pith

MaTe enables high-quality material transfer using only images in a diffusion transformer without text or additional networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion-based material transfer methods typically depend on text prompts, fine-tuning, or extra networks such as ControlNet, which introduce feature misalignment and extra computational costs. MaTe proposes a streamlined approach that integrates the input images directly at the token level. This allows the diffusion transformer to process them together through multi-modal attention within a shared latent space. The result is a zero-shot, training-free system that generates materials while keeping precise detail alignment. This significantly reduces the prerequisites for inference compared to prior techniques.

Core claim

By integrating input images at the token level and processing them via multi-modal attention in a shared latent space, MaTe removes the need for textual guidance, reference networks, adapters, ControlNet, inversion sampling, or model fine-tuning, enabling high-quality material generation in a zero-shot, training-free paradigm that outperforms state-of-the-art methods in visual quality and efficiency.

What carries the argument

Token-level image integration via multi-modal attention in a shared latent space of the diffusion transformer.

If this is right

Material transfer can be performed without any textual input or prompt engineering.
No additional training or fine-tuning of the model is required for new material transfers.
Feature misalignment issues common in prior methods are avoided through unified processing.
The computational cost is lower due to the absence of reference networks and adapters.
Precise detail alignment between source and target is preserved automatically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This token-level approach could extend to other image manipulation tasks where alignment is critical.
Simplifying the pipeline might allow integration into real-time graphics or design tools.
The success without text suggests that visual information alone suffices for certain generation tasks in diffusion models.

Load-bearing premise

Integrating input images at the token level and processing them via multi-modal attention in a shared latent space is sufficient to eliminate feature misalignment and remove the need for textual guidance or extra networks.

What would settle it

Demonstrating cases where material transfer produces misaligned features or poor quality when relying solely on token-level image integration without any textual or network assistance would disprove the claim.

Figures

Figures reproduced from arXiv: 2605.15660 by Chubin Chen, Henglin Liu, Jie Guo, Kaer Huang, Nisha Huang, Tong-Yee Lee, Xiu Li, Yizhou Lin.

**Figure 1.** Figure 1: MaTe is a material transfer method that enables the transformation of textures from a single real-world image without any prior knowledge. This approach is not only capable of successfully extracting texture information from antiques with thousands of years of history but also handles popular computer graphics images, jewelry, and fur materials, providing strong support for design work. Abstract Recent dif… view at source ↗

**Figure 2.** Figure 2: Simplified structure comparison of different kinds of [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Our method achieves high-quality material transfer by simply passing three types of image tokens (material image tokens [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on the MTB dataset. MaTe demonstrates a distinct advantage in handling complex materials. (d)-(g) are [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Material image effect intensity ablation experiment. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation experiment on the depth control parameter [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Regarding the ablation experiment results when [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MaTe pushes a minimal diffusion transformer for zero-shot material transfer by token-level image integration and multi-modal attention, but the abstract gives no metrics or ablations to check if the alignment actually works.

read the letter

The main thing here is that MaTe claims to handle material transfer using only a diffusion transformer by feeding source and target images in as tokens and letting multi-modal attention sort them out in a shared latent space. No text prompts, no reference networks, no adapters, ControlNet, inversion, or fine-tuning. That is the central design choice and the part the authors position as fixing misalignment and extra costs in earlier diffusion methods for this task.

Referee Report

2 major / 1 minor

Summary. The paper proposes MaTe, a streamlined diffusion transformer framework for material transfer. It integrates source and target images directly at the token level for unified processing via multi-modal attention within a shared latent space, eliminating textual guidance, reference networks, adapters, ControlNet, inversion sampling, and model fine-tuning. The central claim is that this yields high-quality, zero-shot, training-free material generation that outperforms prior state-of-the-art methods in visual quality and efficiency while preserving precise detail alignment.

Significance. If the architectural claims hold with rigorous validation, the work would represent a meaningful simplification of diffusion-based material transfer pipelines, reducing reliance on auxiliary conditioning mechanisms and lowering inference costs in computer vision applications such as image editing and synthesis.

major comments (2)

[Abstract] Abstract: the claim of 'extensive experiments' demonstrating outperformance in visual quality, efficiency, and detail alignment supplies no quantitative metrics, baselines, ablation details, dataset descriptions, or evaluation protocols. This absence is load-bearing for the central claim, as the support for superiority under a zero-shot paradigm cannot be assessed from the provided information.
[Method] Method section (token-level integration and multi-modal attention description): the manuscript does not supply a concrete analysis, derivation, or empirical test showing how shared-space multi-modal attention alone enforces cross-image correspondences for material properties (reflectance, texture, lighting) without semantic drift or detail loss. Prior diffusion material methods required explicit mechanisms precisely to address this; the assumption that attention suffices remains unverified and is central to the zero-shot training-free claim.

minor comments (1)

[Abstract] Abstract: the phrase 'precise detail alignment' is used without operational definition or reference to specific metrics (e.g., edge preservation, texture fidelity) that would allow readers to interpret the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions planned for the manuscript to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'extensive experiments' demonstrating outperformance in visual quality, efficiency, and detail alignment supplies no quantitative metrics, baselines, ablation details, dataset descriptions, or evaluation protocols. This absence is load-bearing for the central claim, as the support for superiority under a zero-shot paradigm cannot be assessed from the provided information.

Authors: We agree that the abstract, being a high-level summary, would be strengthened by explicit references to the quantitative support. The experiments section of the manuscript reports comparisons against prior methods using standard metrics for visual quality (e.g., LPIPS, FID), efficiency (inference time and memory), and detail alignment (e.g., structural similarity measures), along with dataset details and protocols. We will revise the abstract to concisely reference these elements and key numerical improvements, ensuring the central claims are better supported without altering the zero-shot emphasis. revision: yes
Referee: [Method] Method section (token-level integration and multi-modal attention description): the manuscript does not supply a concrete analysis, derivation, or empirical test showing how shared-space multi-modal attention alone enforces cross-image correspondences for material properties (reflectance, texture, lighting) without semantic drift or detail loss. Prior diffusion material methods required explicit mechanisms precisely to address this; the assumption that attention suffices remains unverified and is central to the zero-shot training-free claim.

Authors: We acknowledge the value of a more explicit verification for this core design choice. In the revised manuscript we will expand the method section with a short analysis of how token-level integration in the shared latent space enables the multi-modal attention layers to align material attributes across source and target images. We will also add empirical support in the form of attention-map visualizations and targeted ablations that isolate the contribution of the shared-space attention to correspondence preservation. These additions will directly address the concern relative to prior explicit conditioning approaches. revision: yes

Circularity Check

0 steps flagged

Architectural design choice presented without reduction to inputs or self-citations

full rationale

The paper proposes MaTe as a streamlined diffusion framework that integrates source and target images directly at the token level for unified multi-modal attention processing in a shared latent space. This is explicitly positioned as an architectural choice that removes textual guidance, reference networks, adapters, ControlNet, inversion sampling, and fine-tuning to achieve zero-shot training-free material transfer. No equations, derivations, or load-bearing steps in the abstract or described method reduce the central claims to fitted parameters renamed as predictions, self-definitional loops, or chains of self-citations whose validity depends on the present work. The contribution is supported by experimental comparisons rather than internal redefinitions, rendering the framework self-contained as an empirical simplification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities; the contribution is framed as a new architectural integration within existing diffusion transformer components.

pith-pipeline@v0.9.0 · 5667 in / 1102 out tokens · 69985 ms · 2026-05-20T18:35:36.171866+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

[1]

Text2tex: Text-driven tex- ture synthesis via diffusion models

Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner. Text2tex: Text-driven tex- ture synthesis via diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18558–18568, 2023. 3

work page 2023
[2]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Repre- sentations. 3

work page
[3]

Zest: Zero-shot material trans- fer from a single image

Ta-Ying Cheng, Prafull Sharma, Andrew Markham, Niki Trigoni, and Varun Jampani. Zest: Zero-shot material trans- fer from a single image. InEuropean Conference on Com- puter Vision, pages 370–386. Springer, 2024. 2, 3, 4, 6, 7

work page 2024
[4]

Catv- ton: Concatenation is all you need for virtual try-on with dif- fusion models

Zheng Chong, Xiao Dong, Haoxiang Li, Wenqing Zhang, Hanqing Zhao, Dongmei Jiang, Xiaodan Liang, et al. Catv- ton: Concatenation is all you need for virtual try-on with dif- fusion models. InThe Thirteenth International Conference on Learning Representations. 2

work page
[5]

Single-image svbrdf cap- ture with a rendering-aware deep network.ACM Transac- tions on Graphics (ToG), 37(4):1–15, 2018

Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. Single-image svbrdf cap- ture with a rendering-aware deep network.ACM Transac- tions on Graphics (ToG), 37(4):1–15, 2018. 1

work page 2018
[6]

controlnet-depth-sdxl-1.0.https : / / huggingface

Diffusers. controlnet-depth-sdxl-1.0.https : / / huggingface . co / diffusers / controlnet - depth-sdxl-1.0. 6

work page
[7]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InIn- ternational Conference on Machine Learning, pages 12606– 12633. PMLR, 2024. 2, 3

work page 2024
[8]

An image is worth one word: Personalizing text-to-image gener- ation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. InThe Eleventh International Conference on Learning Representations. 2

work page
[9]

Materialfusion: High-quality, zero-shot, and controllable material transfer with diffusion models.arXiv preprint arXiv:2502.06606, 2025

Kamil Garifullin, Maxim Nikolaev, Andrey Kuznetsov, and Aibek Alanov. Materialfusion: High-quality, zero-shot, and controllable material transfer with diffusion models.arXiv preprint arXiv:2502.06606, 2025. 2, 3, 4, 6, 7

work page arXiv 2025
[10]

Generative modelling of brdf textures from flash images.ACM Transactions on Graphics (ToG), 40(6): 1–13, 2021

Philipp Henzler, Valentin Deschaintre, Niloy J Mitra, and To- bias Ritschel. Generative modelling of brdf textures from flash images.ACM Transactions on Graphics (ToG), 40(6): 1–13, 2021. 1

work page 2021
[11]

Prompt-to-prompt image editing with cross-attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. InThe Eleventh Inter- national Conference on Learning Representations. 2

work page
[12]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InThe Tenth Interna- tional Conference on Learning Representations. 2, 4

work page
[13]

A novel framework for inverse procedural texture modeling.ACM Transactions on Graphics (ToG), 38(6):1–14, 2019

Yiwei Hu, Julie Dorsey, and Holly Rushmeier. A novel framework for inverse procedural texture modeling.ACM Transactions on Graphics (ToG), 38(6):1–14, 2019. 1

work page 2019
[14]

An inverse procedural modeling pipeline for svbrdf maps.ACM Transactions on Graphics (ToG), 41(2):1–17, 2022

Yiwei Hu, Chengan He, Valentin Deschaintre, Julie Dorsey, and Holly Rushmeier. An inverse procedural modeling pipeline for svbrdf maps.ACM Transactions on Graphics (ToG), 41(2):1–17, 2022. 1

work page 2022
[15]

Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion

Nisha Huang, Fan Tang, Weiming Dong, and Changsheng Xu. Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion. InProceedings of the 30th ACM International Conference on Multimedia, pages 1085– 1094, 2022. 2

work page 2022
[16]

Diff- styler: Controllable dual diffusion for text-driven image styl- ization.IEEE Transactions on Neural Networks and Learn- ing Systems, 2024

Nisha Huang, Yuxin Zhang, Fan Tang, Chongyang Ma, Haibin Huang, Weiming Dong, and Changsheng Xu. Diff- styler: Controllable dual diffusion for text-driven image styl- ization.IEEE Transactions on Neural Networks and Learn- ing Systems, 2024. 2

work page 2024
[17]

Creativesynth: Cross-art-attention for artis- tic image synthesis with multimodal diffusion.IEEE Trans- actions on Visualization and Computer Graphics, 2025

Nisha Huang, Weiming Dong, Yuxin Zhang, Fan Tang, Ronghui Li, Chongyang Ma, Xiu Li, Tong-Yee Lee, and Changsheng Xu. Creativesynth: Cross-art-attention for artis- tic image synthesis with multimodal diffusion.IEEE Trans- actions on Visualization and Computer Graphics, 2025. 2

work page 2025
[18]

Artcrafter: Text-image aligning style transfer via embedding reframing

Nisha Huang, Kaer Huang, Yifan Pu, Jiangshan Wang, Jie Guo, Yiqiang Yan, Xiu Li, and Tong-Yee Lee. Artcrafter: Text-image aligning style transfer via embedding reframing. arXiv preprint arXiv:2501.02064, 2025. 2

work page arXiv 2025
[19]

Unsplash.https://unsplash.com/

Unsplash Inc. Unsplash.https://unsplash.com/. 6

work page
[20]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. InThe Second International Conference on Learning Representations. 4

work page
[21]

Flux.https://github.com/ black-forest-labs/flux,

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, . 3, 6, 7

work page
[22]

Flux.1-depth-dev-lora.https:// huggingface

Black Forest Labs. Flux.1-depth-dev-lora.https:// huggingface . co / black - forest - labs / FLUX . 1-Depth-dev-lora, . 5

work page
[23]

Flux.1-dev-controlnet-depth.https: / / huggingface

Black Forest Labs. Flux.1-dev-controlnet-depth.https: / / huggingface . co / Shakker - Labs / FLUX . 1 - dev-ControlNet-Depth, . 4

work page
[24]

Ctrl-x: Controlling structure and appear- ance for text-to-image generation without guidance

Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, and Bolei Zhou. Ctrl-x: Controlling structure and appear- ance for text-to-image generation without guidance. InAd- vances in Neural Information Processing Systems, 2024. 2

work page 2024
[25]

Flow matching for genera- tive modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations. 3

work page
[26]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Repre- sentations. 3

work page
[27]

Material palette: extraction of materials from a single image

Ivan Lopes, Fabio Pizzati, and Raoul de Charette. Material palette: extraction of materials from a single image. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4379–4388, 2024. 2, 3

work page 2024
[28]

Qwen2vl-flux: Unifying image and text guidance for controllable image generation, 2024

Pengqi Lu. Qwen2vl-flux: Unifying image and text guidance for controllable image generation, 2024. 2

work page 2024
[29]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. 3, 5

work page 2024
[30]

Multi- modal attention for speech emotion recognition

Zexu Pan, Zhaojie Luo, Jichen Yang, and Haizhou Li. Multi- modal attention for speech emotion recognition. InProc. Interspeech 2020, pages 364–368, 2020. 3

work page 2020
[31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page
[32]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth Interna- tional Conference on Learning Representations. 7

work page
[33]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 7

work page 2021
[34]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Texture: Text-guided texturing of 3d shapes

Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. InACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023. 3

work page 2023
[36]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2, 4, 7

work page 2022
[37]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500– 22510, 2023. 2, 3

work page 2023
[38]

Alchemist: Parametric control of material proper- ties with diffusion models

Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, Bill Freeman, and Mark Matthews. Alchemist: Parametric control of material proper- ties with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24130–24141, 2024. 2

work page 2024
[39]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations. 3

work page
[40]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[41]

Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004. 7

work page 2004
[42]

U-vap: User-specified visual appearance person- alization via decoupled self augmentation

You Wu, Kean Liu, Xiaoyue Mi, Fan Tang, Juan Cao, and Jintao Li. U-vap: User-specified visual appearance person- alization via decoupled self augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9482–9491, 2024. 2, 3, 6, 7

work page 2024
[43]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arxiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Wysiwyg design of hypnotic line art

Chih-Kuo Yeh, Zhanping Liu, I-Hsuan Lin, Eugene Zhang, and Tong-Yee Lee. Wysiwyg design of hypnotic line art. IEEE Transactions on Visualization and Computer Graphics, 28(6):2517–2529, 2020. 1

work page 2020
[45]

Texture- dreamer: Image-guided texture synthesis through geometry- aware diffusion

Yu-Ying Yeh, Jia-Bin Huang, Changil Kim, Lei Xiao, Thu Nguyen-Phuoc, Numair Khan, Cheng Zhang, Manmohan Chandraker, Carl S Marshall, Zhao Dong, et al. Texture- dreamer: Image-guided texture synthesis through geometry- aware diffusion. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 4304–4314, 2024. 3

work page 2024
[46]

Controlnet-v1.1-depth.https : / / huggingface

Lvmin Zhang. Controlnet-v1.1-depth.https : / / huggingface . co / lllyasviel / control _ v11f1p_sd15_depth. 6

work page
[47]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2, 3, 5, 7

work page 2023
[48]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 586–595, 2018. 7

work page 2018
[49]

Prospect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (ToG), 42(6):1–14, 2023

Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, and Changsheng Xu. Prospect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (ToG), 42(6):1–14, 2023. 2, 3, 6, 7

work page 2023
[50]

Evf-sam: Early vision-language fusion for text- prompted segment anything model

Yuxuan Zhang, Tianheng Cheng, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xing- gang Wang. Evf-sam: Early vision-language fusion for text- prompted segment anything model. 2024. 8

work page 2024

[1] [1]

Text2tex: Text-driven tex- ture synthesis via diffusion models

Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner. Text2tex: Text-driven tex- ture synthesis via diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18558–18568, 2023. 3

work page 2023

[2] [2]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Repre- sentations. 3

work page

[3] [3]

Zest: Zero-shot material trans- fer from a single image

Ta-Ying Cheng, Prafull Sharma, Andrew Markham, Niki Trigoni, and Varun Jampani. Zest: Zero-shot material trans- fer from a single image. InEuropean Conference on Com- puter Vision, pages 370–386. Springer, 2024. 2, 3, 4, 6, 7

work page 2024

[4] [4]

Catv- ton: Concatenation is all you need for virtual try-on with dif- fusion models

Zheng Chong, Xiao Dong, Haoxiang Li, Wenqing Zhang, Hanqing Zhao, Dongmei Jiang, Xiaodan Liang, et al. Catv- ton: Concatenation is all you need for virtual try-on with dif- fusion models. InThe Thirteenth International Conference on Learning Representations. 2

work page

[5] [5]

Single-image svbrdf cap- ture with a rendering-aware deep network.ACM Transac- tions on Graphics (ToG), 37(4):1–15, 2018

Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. Single-image svbrdf cap- ture with a rendering-aware deep network.ACM Transac- tions on Graphics (ToG), 37(4):1–15, 2018. 1

work page 2018

[6] [6]

controlnet-depth-sdxl-1.0.https : / / huggingface

Diffusers. controlnet-depth-sdxl-1.0.https : / / huggingface . co / diffusers / controlnet - depth-sdxl-1.0. 6

work page

[7] [7]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InIn- ternational Conference on Machine Learning, pages 12606– 12633. PMLR, 2024. 2, 3

work page 2024

[8] [8]

An image is worth one word: Personalizing text-to-image gener- ation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. InThe Eleventh International Conference on Learning Representations. 2

work page

[9] [9]

Materialfusion: High-quality, zero-shot, and controllable material transfer with diffusion models.arXiv preprint arXiv:2502.06606, 2025

Kamil Garifullin, Maxim Nikolaev, Andrey Kuznetsov, and Aibek Alanov. Materialfusion: High-quality, zero-shot, and controllable material transfer with diffusion models.arXiv preprint arXiv:2502.06606, 2025. 2, 3, 4, 6, 7

work page arXiv 2025

[10] [10]

Generative modelling of brdf textures from flash images.ACM Transactions on Graphics (ToG), 40(6): 1–13, 2021

Philipp Henzler, Valentin Deschaintre, Niloy J Mitra, and To- bias Ritschel. Generative modelling of brdf textures from flash images.ACM Transactions on Graphics (ToG), 40(6): 1–13, 2021. 1

work page 2021

[11] [11]

Prompt-to-prompt image editing with cross-attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. InThe Eleventh Inter- national Conference on Learning Representations. 2

work page

[12] [12]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InThe Tenth Interna- tional Conference on Learning Representations. 2, 4

work page

[13] [13]

A novel framework for inverse procedural texture modeling.ACM Transactions on Graphics (ToG), 38(6):1–14, 2019

Yiwei Hu, Julie Dorsey, and Holly Rushmeier. A novel framework for inverse procedural texture modeling.ACM Transactions on Graphics (ToG), 38(6):1–14, 2019. 1

work page 2019

[14] [14]

An inverse procedural modeling pipeline for svbrdf maps.ACM Transactions on Graphics (ToG), 41(2):1–17, 2022

Yiwei Hu, Chengan He, Valentin Deschaintre, Julie Dorsey, and Holly Rushmeier. An inverse procedural modeling pipeline for svbrdf maps.ACM Transactions on Graphics (ToG), 41(2):1–17, 2022. 1

work page 2022

[15] [15]

Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion

Nisha Huang, Fan Tang, Weiming Dong, and Changsheng Xu. Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion. InProceedings of the 30th ACM International Conference on Multimedia, pages 1085– 1094, 2022. 2

work page 2022

[16] [16]

Diff- styler: Controllable dual diffusion for text-driven image styl- ization.IEEE Transactions on Neural Networks and Learn- ing Systems, 2024

Nisha Huang, Yuxin Zhang, Fan Tang, Chongyang Ma, Haibin Huang, Weiming Dong, and Changsheng Xu. Diff- styler: Controllable dual diffusion for text-driven image styl- ization.IEEE Transactions on Neural Networks and Learn- ing Systems, 2024. 2

work page 2024

[17] [17]

Creativesynth: Cross-art-attention for artis- tic image synthesis with multimodal diffusion.IEEE Trans- actions on Visualization and Computer Graphics, 2025

Nisha Huang, Weiming Dong, Yuxin Zhang, Fan Tang, Ronghui Li, Chongyang Ma, Xiu Li, Tong-Yee Lee, and Changsheng Xu. Creativesynth: Cross-art-attention for artis- tic image synthesis with multimodal diffusion.IEEE Trans- actions on Visualization and Computer Graphics, 2025. 2

work page 2025

[18] [18]

Artcrafter: Text-image aligning style transfer via embedding reframing

Nisha Huang, Kaer Huang, Yifan Pu, Jiangshan Wang, Jie Guo, Yiqiang Yan, Xiu Li, and Tong-Yee Lee. Artcrafter: Text-image aligning style transfer via embedding reframing. arXiv preprint arXiv:2501.02064, 2025. 2

work page arXiv 2025

[19] [19]

Unsplash.https://unsplash.com/

Unsplash Inc. Unsplash.https://unsplash.com/. 6

work page

[20] [20]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. InThe Second International Conference on Learning Representations. 4

work page

[21] [21]

Flux.https://github.com/ black-forest-labs/flux,

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, . 3, 6, 7

work page

[22] [22]

Flux.1-depth-dev-lora.https:// huggingface

Black Forest Labs. Flux.1-depth-dev-lora.https:// huggingface . co / black - forest - labs / FLUX . 1-Depth-dev-lora, . 5

work page

[23] [23]

Flux.1-dev-controlnet-depth.https: / / huggingface

Black Forest Labs. Flux.1-dev-controlnet-depth.https: / / huggingface . co / Shakker - Labs / FLUX . 1 - dev-ControlNet-Depth, . 4

work page

[24] [24]

Ctrl-x: Controlling structure and appear- ance for text-to-image generation without guidance

Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, and Bolei Zhou. Ctrl-x: Controlling structure and appear- ance for text-to-image generation without guidance. InAd- vances in Neural Information Processing Systems, 2024. 2

work page 2024

[25] [25]

Flow matching for genera- tive modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations. 3

work page

[26] [26]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Repre- sentations. 3

work page

[27] [27]

Material palette: extraction of materials from a single image

Ivan Lopes, Fabio Pizzati, and Raoul de Charette. Material palette: extraction of materials from a single image. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4379–4388, 2024. 2, 3

work page 2024

[28] [28]

Qwen2vl-flux: Unifying image and text guidance for controllable image generation, 2024

Pengqi Lu. Qwen2vl-flux: Unifying image and text guidance for controllable image generation, 2024. 2

work page 2024

[29] [29]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. 3, 5

work page 2024

[30] [30]

Multi- modal attention for speech emotion recognition

Zexu Pan, Zhaojie Luo, Jichen Yang, and Haizhou Li. Multi- modal attention for speech emotion recognition. InProc. Interspeech 2020, pages 364–368, 2020. 3

work page 2020

[31] [31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page

[32] [32]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth Interna- tional Conference on Learning Representations. 7

work page

[33] [33]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 7

work page 2021

[34] [34]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Texture: Text-guided texturing of 3d shapes

Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. InACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023. 3

work page 2023

[36] [36]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2, 4, 7

work page 2022

[37] [37]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500– 22510, 2023. 2, 3

work page 2023

[38] [38]

Alchemist: Parametric control of material proper- ties with diffusion models

Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, Bill Freeman, and Mark Matthews. Alchemist: Parametric control of material proper- ties with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24130–24141, 2024. 2

work page 2024

[39] [39]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations. 3

work page

[40] [40]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page

[41] [41]

Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004. 7

work page 2004

[42] [42]

U-vap: User-specified visual appearance person- alization via decoupled self augmentation

You Wu, Kean Liu, Xiaoyue Mi, Fan Tang, Juan Cao, and Jintao Li. U-vap: User-specified visual appearance person- alization via decoupled self augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9482–9491, 2024. 2, 3, 6, 7

work page 2024

[43] [43]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arxiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Wysiwyg design of hypnotic line art

Chih-Kuo Yeh, Zhanping Liu, I-Hsuan Lin, Eugene Zhang, and Tong-Yee Lee. Wysiwyg design of hypnotic line art. IEEE Transactions on Visualization and Computer Graphics, 28(6):2517–2529, 2020. 1

work page 2020

[45] [45]

Texture- dreamer: Image-guided texture synthesis through geometry- aware diffusion

Yu-Ying Yeh, Jia-Bin Huang, Changil Kim, Lei Xiao, Thu Nguyen-Phuoc, Numair Khan, Cheng Zhang, Manmohan Chandraker, Carl S Marshall, Zhao Dong, et al. Texture- dreamer: Image-guided texture synthesis through geometry- aware diffusion. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 4304–4314, 2024. 3

work page 2024

[46] [46]

Controlnet-v1.1-depth.https : / / huggingface

Lvmin Zhang. Controlnet-v1.1-depth.https : / / huggingface . co / lllyasviel / control _ v11f1p_sd15_depth. 6

work page

[47] [47]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2, 3, 5, 7

work page 2023

[48] [48]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 586–595, 2018. 7

work page 2018

[49] [49]

Prospect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (ToG), 42(6):1–14, 2023

Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, and Changsheng Xu. Prospect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (ToG), 42(6):1–14, 2023. 2, 3, 6, 7

work page 2023

[50] [50]

Evf-sam: Early vision-language fusion for text- prompted segment anything model

Yuxuan Zhang, Tianheng Cheng, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xing- gang Wang. Evf-sam: Early vision-language fusion for text- prompted segment anything model. 2024. 8

work page 2024