pith. sign in

arxiv: 2605.15681 · v1 · pith:LE7WGICDnew · submitted 2026-05-15 · 💻 cs.GR · cs.CV

DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer

Pith reviewed 2026-05-19 19:08 UTC · model grok-4.3

classification 💻 cs.GR cs.CV
keywords material transferdiffusion transformer3D shader LoRAcomputer graphicsneural renderingimage synthesisattention optimization
0
0 comments X

The pith

DealMaTe transfers materials across objects using depth, normal, and lighting images in a text-free diffusion framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DealMaTe to address challenges like text dependency and extra computational costs in diffusion-based material transfer. It builds a simplified diffusion transformer that uses depth, normal, and lighting images directly as conditions. A lightweight Multi-Dim 3D Shader LoRA injects these 3D cues without altering the base model. Shader Causal Mutual Attention plus KV caching improves speed and efficiency. Experiments across varied objects and lighting show consistent high-fidelity results for arbitrary input materials.

Core claim

DealMaTe is a simplified diffusion framework for material transfer that relies solely on depth, normal, and lighting images. It introduces Multi-Dim 3D Shader LoRA to add 3D control conditions compatibly without changing base model weights and applies Shader Causal Mutual Attention with key-value caching to reduce latency from multiple inputs while preserving output quality.

What carries the argument

Multi-Dim 3D Shader LoRA, a lightweight adapter that injects depth, normal, and lighting information into the diffusion transformer for compatible control.

Load-bearing premise

The lightweight 3D information injection via Multi-Dim 3D Shader LoRA enables compatible control conditions and achieves harmonious and stable results without modifying the base model weights.

What would settle it

A case showing feature misalignment or unstable outputs when transferring materials under complex geometry or extreme lighting would disprove reliable high-fidelity performance.

Figures

Figures reproduced from arXiv: 2605.15681 by Jie Guo, Nisha Huang, Tong-Yee Lee, Xiu Li, Yizhou Lin, Zitong Yu.

Figure 1
Figure 1. Figure 1: DealMaTe is a material transfer method that can transform materials from a single real-world image without any prior knowledge. This method is not only capable of successfully extracting texture information from antiques with thousands of years of history but also handles novel virtual materials generated by computer graphics images. It excels across diverse scenarios such as product design, antique restor… view at source ↗
Figure 2
Figure 2. Figure 2: Simplified structure comparison of different kinds of material transfer methods. Our approach neither relies on fine-tuning image sets/individual [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison with our conference version work MaTe. DealMaTe is [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Our method achieves high-quality material transfer by feeding depth, normal, and lighting inputs into respective shader LoRAs and then passing the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: To optimize the inference performance of the material transfer task, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons with other methods. Column (d) shows the result of our conference version work MaTe. Columns (e)-(g) show a qualitative [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparison of the SCMA ablation. With SCMA, the gener [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: We evaluate the necessity of each control signal by removing exactly one condition (depth, lighting, or normal) from our 3D Shader Conditional Branch. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation experiment on the varying depth control parameter [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation analysis of the varying normal control parameter [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 7
Figure 7. Figure 7: For metallic and plastic samples spurious patterns emerge [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ablation experiment on the varying lighting control parameter [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Limitations. When encountering the extreme cases shown in the [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Failure cases due to inaccurate geometric conditions estimation. [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗
Figure 13
Figure 13. Figure 13: Various downstream applications of our material transfer method. [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: Discussions. The results when the material diagram contains multi [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗
read the original abstract

Recently, diffusion-based material transfer methods rely on image fine-tuning or complex architectures with auxiliary networks but face challenges such as text dependency, additional computational costs, and feature misalignment. To address these limitations, we propose \textbf{DealMaTe}, using \underline{\textbf{de}}pth, norm\underline{\textbf{a}}l, and \underline{\textbf{l}}ighting images for \underline{\textbf{ma}}terial \underline{\textbf{t}}ransf\underline{\textbf{e}}r. DealMaTe is a simplified diffusion framework that eliminates text guidance and reference networks. We design a lightweight 3D information injection method, Multi-Dim 3D Shader LoRA, which, without modifying the base model weights, enables compatible control conditions and achieves harmonious and stable results. Additionally, we optimize the attention mechanism with Shader Causal Mutual Attention and key-value (KV) caching to reduce inference latency caused by multiple conditions, improve computational efficiency, and achieve high-quality material transfer results with low architectural complexity. Extensive experiments covering a wide variety of objects and lighting conditions consistently demonstrate that DealMaTe achieves remarkable high-fidelity material transfer under arbitrary input materials. The code is available at https://github.com/haha-lisa/DealMaTe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DealMaTe, a diffusion transformer framework for multi-dimensional material transfer that takes depth, normal, and lighting images as conditions. It eliminates text guidance and reference networks, introducing a lightweight Multi-Dim 3D Shader LoRA for 3D information injection without altering base model weights, plus Shader Causal Mutual Attention and KV caching to improve efficiency and reduce latency from multiple conditions. The authors claim that extensive experiments across varied objects and lighting conditions demonstrate remarkable high-fidelity transfer under arbitrary input materials, with code released at the provided GitHub link.

Significance. If the quantitative validation holds, the work offers a simplified, lower-complexity alternative to existing diffusion-based material transfer methods in computer graphics, potentially reducing text dependency, auxiliary network overhead, and feature misalignment. The emphasis on compatible control conditions via LoRA and efficiency optimizations, combined with code availability, supports reproducibility and practical adoption.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'extensive experiments... consistently demonstrate that DealMaTe achieves remarkable high-fidelity material transfer' provides no quantitative metrics, baselines, ablation results, or failure cases. This is load-bearing for the contribution, as the support for the method's effectiveness cannot be assessed from the stated claims alone.
  2. [Method] Method (Multi-Dim 3D Shader LoRA description): the assertion that the lightweight injection 'enables compatible control conditions and achieves harmonious and stable results without modifying the base model weights' lacks any derivation, analysis of feature alignment across material distributions, or ablation isolating the LoRA contribution versus the Shader Causal Mutual Attention. This is the least-secured step for the high-fidelity claim under arbitrary inputs.
minor comments (2)
  1. [Abstract] The acronym expansion for DealMaTe is given but could be stated more explicitly on first use in the title or abstract for clarity.
  2. [Method] Notation for the attention mechanism (e.g., 'Shader Causal Mutual Attention') would benefit from a brief equation or diagram reference to distinguish it from standard causal attention.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'extensive experiments... consistently demonstrate that DealMaTe achieves remarkable high-fidelity material transfer' provides no quantitative metrics, baselines, ablation results, or failure cases. This is load-bearing for the contribution, as the support for the method's effectiveness cannot be assessed from the stated claims alone.

    Authors: We agree that the abstract, in its original form, presented the experimental outcomes at a high level without specific quantitative support. The full manuscript contains extensive quantitative evaluations, baseline comparisons, and ablation studies in Section 4, but these were not reflected in the abstract. We have revised the abstract to include key metrics (such as PSNR, SSIM, and LPIPS scores against baselines) and a brief reference to the ablation results, while noting that failure cases are analyzed in the supplementary material. This change directly addresses the concern about substantiating the high-fidelity claims. revision: yes

  2. Referee: [Method] Method (Multi-Dim 3D Shader LoRA description): the assertion that the lightweight injection 'enables compatible control conditions and achieves harmonious and stable results without modifying the base model weights' lacks any derivation, analysis of feature alignment across material distributions, or ablation isolating the LoRA contribution versus the Shader Causal Mutual Attention. This is the least-secured step for the high-fidelity claim under arbitrary inputs.

    Authors: We appreciate this observation regarding the need for more rigorous justification. The original manuscript describes the Multi-Dim 3D Shader LoRA design and its practical benefits in Section 3, including how it injects conditions without altering base weights. However, we acknowledge the value of explicit derivation and isolation. We have added a derivation of the feature alignment mechanism across material distributions in the revised Section 3.2 and included a dedicated ablation study in Section 4.3 that compares variants with and without the LoRA (versus the attention module alone). These additions demonstrate the contribution to harmonious and stable results under arbitrary inputs. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical method is self-contained

full rationale

The paper describes an engineering contribution: a diffusion-based material transfer pipeline that injects depth/normal/lighting via Multi-Dim 3D Shader LoRA and optimizes attention with Shader Causal Mutual Attention. No equations, first-principles derivations, or parameter-fitting steps are presented that could reduce to their own inputs by construction. Claims of compatibility and high-fidelity results are supported by external experiments on varied objects and lighting, not by any self-referential definition or fitted-input prediction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. The work is therefore scored as having no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method relies on standard diffusion models and LoRA adapters whose properties are assumed from prior work.

pith-pipeline@v0.9.0 · 5765 in / 1041 out tokens · 40066 ms · 2026-05-19T19:08:34.414842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

  1. [1]

    Louis-Philippe Asselin, Denis Laurendeau, and Jean-Francois Lalonde. 2020. Deep SVBRDF estimation on real materials. InInternational Conference on 3D Vision (3DV). IEEE, 1157–1166

  2. [2]

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18392–18402

  3. [3]

    George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. 2022. Wearable ImageNet: Synthesizing tileable textures via dataset distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2278–2282

  4. [4]

    Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner. 2023. Text2tex: Text-driven texture synthesis via diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 18558– 18568

  5. [5]

    Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. 2024. Pixart-𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InInternational Conference on Learning Representations

  6. [6]

    Ta Ying Cheng, Prafull Sharma, Mark Boss, and Varun Jampani. 2025. MAR- BLE: Material Recomposition and Blending in CLIP-Space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13061–13071

  7. [7]

    Ta-Ying Cheng, Prafull Sharma, Andrew Markham, Niki Trigoni, and Varun Jampani. 2024. Zest: Zero-shot material transfer from a single image. InEuropean Conference on Computer Vision. Springer, 370–386

  8. [8]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  9. [9]

    Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018. Single-image svbrdf capture with a rendering-aware deep net- work.ACM Transactions on Graphics (TOG)37, 4 (2018), 1–15

  10. [10]

    Valentin Deschaintre, George Drettakis, and Adrien Bousseau. 2020. Guided fine-tuning for large-scale material transfer. InComputer Graphics Forum, Vol. 39. Wiley Online Library, 91–105

  11. [11]

    Valentin Deschaintre, Yiming Lin, and Abhijeet Ghosh. 2021. Deep polarization imaging for 3d shape and svbrdf acquisition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15567–15576

  12. [12]

    Olga Diamanti, Connelly Barnes, Sylvain Paris, Eli Shechtman, and Olga Sorkine- Hornung. 2015. Synthesis of complex image appearance from limited exemplars. ACM Transactions on Graphics (TOG)34, 2 (2015), 1–14

  13. [13]

    Diffusers. 2023. controlnet-depth-sdxl-1.0. https://huggingface.co/diffusers/ controlnet-depth-sdxl-1.0

  14. [14]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In International Conference on Machine Learning. PMLR, 12606–12633

  15. [15]

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. 2023. DreamSim: Learning New Dimensions of Human , Vol. 1, No. 1, Article . Publication date: May 2026. 14•Huang et al. Visual Similarity using Synthetic Data.Advances in Neural Information Processing Systems36 (2023), 50742–50768

  16. [16]

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. 2023. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. InInternational Conference on Learning Representations

  17. [17]

    Duan Gao, Xiao Li, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. 2019. Deep in- verse rendering for high-resolution SVBRDF estimation from an arbitrary number of images.ACM Transactions on Graphics (TOG)38, 4 (2019), 134–1

  18. [18]

    Kamil Garifullin, Maxim Nikolaev, Andrey Kuznetsov, and Aibek Alanov. 2025. MaterialFusion: High-Quality, Zero-Shot, and Controllable Material Transfer with Diffusion Models.arXiv preprint arXiv:2502.06606(2025)

  19. [19]

    Guarnera, G.C

    D. Guarnera, G.C. Guarnera, A. Ghosh, C. Denk, and M. Glencross. 2016. BRDF Representation and Acquisition.Computer Graphics Forum35, 2 (2016), 625–650. doi:10.1111/cgf.12867

  20. [20]

    Philipp Henzler, Valentin Deschaintre, Niloy J Mitra, and Tobias Ritschel. 2021. Generative modelling of BRDF textures from flash images.ACM Transactions on Graphics (TOG)40, 6 (2021), 1–13

  21. [21]

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. 2023. Prompt-to-Prompt Image Editing with Cross-Attention Control. InInternational Conference on Learning Representations

  22. [22]

    Alain Hore and Djemel Ziou. 2010. Image quality metrics: PSNR vs. SSIM. In2010 20th international conference on pattern recognition. IEEE, 2366–2369

  23. [23]

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations

  24. [24]

    Yiwei Hu, Julie Dorsey, and Holly Rushmeier. 2019. A novel framework for inverse procedural texture modeling.ACM Transactions on Graphics (TOG)38, 6 (2019), 1–14

  25. [25]

    Yiwei Hu, Chengan He, Valentin Deschaintre, Julie Dorsey, and Holly Rushmeier

  26. [26]

    An inverse procedural modeling pipeline for svbrdf maps.ACM Transactions on Graphics (TOG)41, 2 (2022), 1–17

  27. [27]

    Nisha Huang, Weiming Dong, Yuxin Zhang, Fan Tang, Ronghui Li, Chongyang Ma, Xiu Li, Tong-Yee Lee, and Changsheng Xu. 2025. CreativeSynth: Cross- Art-Attention for Artistic Image Synthesis With Multimodal Diffusion.IEEE Transactions on Visualization and Computer Graphics(2025)

  28. [28]

    Nisha Huang, Henglin Liu, Yizhou Lin, Kaer Huang, Chubin Chen, Jie Guo, Tong- yee Lee, and Xiu Li. 2025. MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision. 15117–15126

  29. [29]

    Nisha Huang, Yuxin Zhang, Fan Tang, Chongyang Ma, Haibin Huang, Weiming Dong, and Changsheng Xu. 2024. Diffstyler: Controllable dual diffusion for text- driven image stylization.IEEE Transactions on Neural Networks and Learning Systems(2024)

  30. [30]

    Unsplash Inc. 2025. Unsplash. https://unsplash.com/

  31. [31]

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. 2024. Repurposing Diffusion-Based Image Genera- tors for Monocular Depth Estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  32. [32]

    Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Konrad Schindler. 2025. Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis. arXiv:2505.09358 [cs.CV]

  33. [33]

    Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

  34. [34]

    Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2017. Modeling surface appearance from a single photograph using self-augmented convolutional neural networks. ACM Transactions on Graphics (TOG)36, 4 (2017), 1–11

  35. [35]

    Xueting Li, Xiaolong Wang, Ming-Hsuan Yang, Alexei A Efros, and Sifei Liu. 2022. Scraping textures from natural images for synthesis and editing. InEuropean Conference on Computer Vision. Springer, 391–408

  36. [36]

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2023. Flow Matching for Generative Modeling. InInternational Conference on Learning Representations

  37. [37]

    Xingchao Liu, Chengyue Gong, et al. 2023. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. InInternational Conference on Learning Representations

  38. [38]

    Ivan Lopes, Fabio Pizzati, and Raoul de Charette. 2024. Material palette: Extraction of materials from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4379–4388

  39. [39]

    Xiaohe Ma, Valentin Deschaintre, Miloš Hašan, Fujun Luan, Kun Zhou, Hongzhi Wu, and Yiwei Hu. 2025. MaterialPicker: Multi-Modal DiT-Based Material Gener- ation.ACM Transactions on Graphics (TOG)44, 4 (2025), 1–12

  40. [40]

    Xinyin Ma, Gongfan Fang, Michael Bi Mi, and Xinchao Wang. 2024. Learning-to- cache: Accelerating diffusion transformer via layer caching.Advances in Neural Information Processing Systems37 (2024), 133282–133304

  41. [41]

    Rosalie Martin, Arthur Roullier, Romain Rouffet, Adrien Kaiser, and Tamy Boubekeur. 2022. MaterIA: Single Image High-Resolution Material Capture in the Wild. InComputer Graphics Forum, Vol. 41. Wiley Online Library, 163–177

  42. [42]

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. 2024. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4296–4304

  43. [43]

    Zexu Pan, Zhaojie Luo, Jichen Yang, and Haizhou Li. 2020. Multi-Modal Attention for Speech Emotion Recognition. InProceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). 364–368

  44. [44]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4195–4205

  45. [45]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  46. [46]

    In International Conference on Machine Learning

    Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763

  47. [47]

    Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or

  48. [48]

    InACM SIGGRAPH Conference Proceedings

    Texture: Text-guided texturing of 3d shapes. InACM SIGGRAPH Conference Proceedings. 1–11

  49. [49]

    Carlos Rodriguez-Pardo, Henar Dominguez-Elvira, David Pascual-Hernandez, and Elena Garces. 2023. Umat: Uncertainty-aware single image high resolution material capture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5764–5774

  50. [50]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695

  51. [51]

    Amir Rosenberger, Daniel Cohen-Or, and Dani Lischinski. 2009. Layered shape synthesis: automatic generation of control maps for non-stationary textures.ACM Transactions on Graphics (TOG)28, 5 (2009), 1–9

  52. [52]

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510

  53. [53]

    Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, Bill Freeman, and Mark Matthews. 2024. Alchemist: Parametric control of material properties with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24130–24141

  54. [54]

    Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. InInternational Conference on Learning Representations

  55. [55]

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu

  56. [56]

    Roformer: Enhanced transformer with rotary position embedding.Neuro- computing568 (2024), 127063

  57. [57]

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang

  58. [58]

    In Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ominicontrol: Minimal and universal control for diffusion transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14940– 14950

  59. [59]

    Giuseppe Vecchio, Simone Palazzo, and Concetto Spampinato. 2021. Surfacenet: Adversarial svbrdf estimation from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision. 12840–12848

  60. [60]

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing13, 4 (2004), 600–612

  61. [61]

    Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, et al. 2024. Cache me if you can: Accelerating diffusion models through block caching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6211–6220

  62. [62]

    You Wu, Kean Liu, Xiaoyue Mi, Fan Tang, Juan Cao, and Jintao Li. 2024. U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmenta- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9482–9491

  63. [63]

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models.arXiv preprint arxiv:2308.06721(2023)

  64. [64]

    Chih-Kuo Yeh, Zhanping Liu, I-Hsuan Lin, Eugene Zhang, and Tong-Yee Lee. 2020. WYSIWYG Design of Hypnotic Line Art.IEEE Transactions on Visualization and Computer Graphics28, 6 (2020), 2517–2529

  65. [65]

    Yu-Ying Yeh, Jia-Bin Huang, Changil Kim, Lei Xiao, Thu Nguyen-Phuoc, Numair Khan, Cheng Zhang, Manmohan Chandraker, Carl S Marshall, Zhao Dong, et al

  66. [66]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Texturedreamer: Image-guided texture synthesis through geometry-aware diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4304–4314

  67. [67]

    Lvmin Zhang. 2023. Controlnet-v1.1-depth. https://huggingface.co/lllyasviel/ control_v11f1p_sd15_depth

  68. [68]

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847. , Vol. 1, No. 1, Article . Publication date: May 2026. DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer•15

  69. [69]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  70. [70]

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 586–595

  71. [71]

    Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, and Changsheng Xu. 2023. Prospect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transac- tions on Graphics (TOG)42, 6 (2023), 1–14. , Vol. 1, No. 1, Article . Publication date: May 2026