pith. sign in

arxiv: 2605.15660 · v1 · pith:3I2BXB52new · submitted 2026-05-15 · 💻 cs.CV

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

Pith reviewed 2026-05-20 18:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords material transferdiffusion transformerzero-shotimage integrationmulti-modal attentiontraining-free
0
0 comments X

The pith

MaTe enables high-quality material transfer using only images in a diffusion transformer without text or additional networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion-based material transfer methods typically depend on text prompts, fine-tuning, or extra networks such as ControlNet, which introduce feature misalignment and extra computational costs. MaTe proposes a streamlined approach that integrates the input images directly at the token level. This allows the diffusion transformer to process them together through multi-modal attention within a shared latent space. The result is a zero-shot, training-free system that generates materials while keeping precise detail alignment. This significantly reduces the prerequisites for inference compared to prior techniques.

Core claim

By integrating input images at the token level and processing them via multi-modal attention in a shared latent space, MaTe removes the need for textual guidance, reference networks, adapters, ControlNet, inversion sampling, or model fine-tuning, enabling high-quality material generation in a zero-shot, training-free paradigm that outperforms state-of-the-art methods in visual quality and efficiency.

What carries the argument

Token-level image integration via multi-modal attention in a shared latent space of the diffusion transformer.

If this is right

  • Material transfer can be performed without any textual input or prompt engineering.
  • No additional training or fine-tuning of the model is required for new material transfers.
  • Feature misalignment issues common in prior methods are avoided through unified processing.
  • The computational cost is lower due to the absence of reference networks and adapters.
  • Precise detail alignment between source and target is preserved automatically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This token-level approach could extend to other image manipulation tasks where alignment is critical.
  • Simplifying the pipeline might allow integration into real-time graphics or design tools.
  • The success without text suggests that visual information alone suffices for certain generation tasks in diffusion models.

Load-bearing premise

Integrating input images at the token level and processing them via multi-modal attention in a shared latent space is sufficient to eliminate feature misalignment and remove the need for textual guidance or extra networks.

What would settle it

Demonstrating cases where material transfer produces misaligned features or poor quality when relying solely on token-level image integration without any textual or network assistance would disprove the claim.

Figures

Figures reproduced from arXiv: 2605.15660 by Chubin Chen, Henglin Liu, Jie Guo, Kaer Huang, Nisha Huang, Tong-Yee Lee, Xiu Li, Yizhou Lin.

Figure 1
Figure 1. Figure 1: MaTe is a material transfer method that enables the transformation of textures from a single real-world image without any prior knowledge. This approach is not only capable of successfully extracting texture information from antiques with thousands of years of history but also handles popular computer graphics images, jewelry, and fur materials, providing strong support for design work. Abstract Recent dif… view at source ↗
Figure 2
Figure 2. Figure 2: Simplified structure comparison of different kinds of [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Our method achieves high-quality material transfer by simply passing three types of image tokens (material image tokens [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on the MTB dataset. MaTe demonstrates a distinct advantage in handling complex materials. (d)-(g) are [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Material image effect intensity ablation experiment. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation experiment on the depth control parameter [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Regarding the ablation experiment results when [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MaTe, a streamlined diffusion transformer framework for material transfer. It integrates source and target images directly at the token level for unified processing via multi-modal attention within a shared latent space, eliminating textual guidance, reference networks, adapters, ControlNet, inversion sampling, and model fine-tuning. The central claim is that this yields high-quality, zero-shot, training-free material generation that outperforms prior state-of-the-art methods in visual quality and efficiency while preserving precise detail alignment.

Significance. If the architectural claims hold with rigorous validation, the work would represent a meaningful simplification of diffusion-based material transfer pipelines, reducing reliance on auxiliary conditioning mechanisms and lowering inference costs in computer vision applications such as image editing and synthesis.

major comments (2)
  1. [Abstract] Abstract: the claim of 'extensive experiments' demonstrating outperformance in visual quality, efficiency, and detail alignment supplies no quantitative metrics, baselines, ablation details, dataset descriptions, or evaluation protocols. This absence is load-bearing for the central claim, as the support for superiority under a zero-shot paradigm cannot be assessed from the provided information.
  2. [Method] Method section (token-level integration and multi-modal attention description): the manuscript does not supply a concrete analysis, derivation, or empirical test showing how shared-space multi-modal attention alone enforces cross-image correspondences for material properties (reflectance, texture, lighting) without semantic drift or detail loss. Prior diffusion material methods required explicit mechanisms precisely to address this; the assumption that attention suffices remains unverified and is central to the zero-shot training-free claim.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'precise detail alignment' is used without operational definition or reference to specific metrics (e.g., edge preservation, texture fidelity) that would allow readers to interpret the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions planned for the manuscript to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'extensive experiments' demonstrating outperformance in visual quality, efficiency, and detail alignment supplies no quantitative metrics, baselines, ablation details, dataset descriptions, or evaluation protocols. This absence is load-bearing for the central claim, as the support for superiority under a zero-shot paradigm cannot be assessed from the provided information.

    Authors: We agree that the abstract, being a high-level summary, would be strengthened by explicit references to the quantitative support. The experiments section of the manuscript reports comparisons against prior methods using standard metrics for visual quality (e.g., LPIPS, FID), efficiency (inference time and memory), and detail alignment (e.g., structural similarity measures), along with dataset details and protocols. We will revise the abstract to concisely reference these elements and key numerical improvements, ensuring the central claims are better supported without altering the zero-shot emphasis. revision: yes

  2. Referee: [Method] Method section (token-level integration and multi-modal attention description): the manuscript does not supply a concrete analysis, derivation, or empirical test showing how shared-space multi-modal attention alone enforces cross-image correspondences for material properties (reflectance, texture, lighting) without semantic drift or detail loss. Prior diffusion material methods required explicit mechanisms precisely to address this; the assumption that attention suffices remains unverified and is central to the zero-shot training-free claim.

    Authors: We acknowledge the value of a more explicit verification for this core design choice. In the revised manuscript we will expand the method section with a short analysis of how token-level integration in the shared latent space enables the multi-modal attention layers to align material attributes across source and target images. We will also add empirical support in the form of attention-map visualizations and targeted ablations that isolate the contribution of the shared-space attention to correspondence preservation. These additions will directly address the concern relative to prior explicit conditioning approaches. revision: yes

Circularity Check

0 steps flagged

Architectural design choice presented without reduction to inputs or self-citations

full rationale

The paper proposes MaTe as a streamlined diffusion framework that integrates source and target images directly at the token level for unified multi-modal attention processing in a shared latent space. This is explicitly positioned as an architectural choice that removes textual guidance, reference networks, adapters, ControlNet, inversion sampling, and fine-tuning to achieve zero-shot training-free material transfer. No equations, derivations, or load-bearing steps in the abstract or described method reduce the central claims to fitted parameters renamed as predictions, self-definitional loops, or chains of self-citations whose validity depends on the present work. The contribution is supported by experimental comparisons rather than internal redefinitions, rendering the framework self-contained as an empirical simplification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities; the contribution is framed as a new architectural integration within existing diffusion transformer components.

pith-pipeline@v0.9.0 · 5667 in / 1102 out tokens · 69985 ms · 2026-05-20T18:35:36.171866+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

  1. [1]

    Text2tex: Text-driven tex- ture synthesis via diffusion models

    Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nießner. Text2tex: Text-driven tex- ture synthesis via diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18558–18568, 2023. 3

  2. [2]

    Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Repre- sentations. 3

  3. [3]

    Zest: Zero-shot material trans- fer from a single image

    Ta-Ying Cheng, Prafull Sharma, Andrew Markham, Niki Trigoni, and Varun Jampani. Zest: Zero-shot material trans- fer from a single image. InEuropean Conference on Com- puter Vision, pages 370–386. Springer, 2024. 2, 3, 4, 6, 7

  4. [4]

    Catv- ton: Concatenation is all you need for virtual try-on with dif- fusion models

    Zheng Chong, Xiao Dong, Haoxiang Li, Wenqing Zhang, Hanqing Zhao, Dongmei Jiang, Xiaodan Liang, et al. Catv- ton: Concatenation is all you need for virtual try-on with dif- fusion models. InThe Thirteenth International Conference on Learning Representations. 2

  5. [5]

    Single-image svbrdf cap- ture with a rendering-aware deep network.ACM Transac- tions on Graphics (ToG), 37(4):1–15, 2018

    Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. Single-image svbrdf cap- ture with a rendering-aware deep network.ACM Transac- tions on Graphics (ToG), 37(4):1–15, 2018. 1

  6. [6]

    controlnet-depth-sdxl-1.0.https : / / huggingface

    Diffusers. controlnet-depth-sdxl-1.0.https : / / huggingface . co / diffusers / controlnet - depth-sdxl-1.0. 6

  7. [7]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InIn- ternational Conference on Machine Learning, pages 12606– 12633. PMLR, 2024. 2, 3

  8. [8]

    An image is worth one word: Personalizing text-to-image gener- ation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. InThe Eleventh International Conference on Learning Representations. 2

  9. [9]

    Materialfusion: High-quality, zero-shot, and controllable material transfer with diffusion models.arXiv preprint arXiv:2502.06606, 2025

    Kamil Garifullin, Maxim Nikolaev, Andrey Kuznetsov, and Aibek Alanov. Materialfusion: High-quality, zero-shot, and controllable material transfer with diffusion models.arXiv preprint arXiv:2502.06606, 2025. 2, 3, 4, 6, 7

  10. [10]

    Generative modelling of brdf textures from flash images.ACM Transactions on Graphics (ToG), 40(6): 1–13, 2021

    Philipp Henzler, Valentin Deschaintre, Niloy J Mitra, and To- bias Ritschel. Generative modelling of brdf textures from flash images.ACM Transactions on Graphics (ToG), 40(6): 1–13, 2021. 1

  11. [11]

    Prompt-to-prompt image editing with cross-attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. InThe Eleventh Inter- national Conference on Learning Representations. 2

  12. [12]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InThe Tenth Interna- tional Conference on Learning Representations. 2, 4

  13. [13]

    A novel framework for inverse procedural texture modeling.ACM Transactions on Graphics (ToG), 38(6):1–14, 2019

    Yiwei Hu, Julie Dorsey, and Holly Rushmeier. A novel framework for inverse procedural texture modeling.ACM Transactions on Graphics (ToG), 38(6):1–14, 2019. 1

  14. [14]

    An inverse procedural modeling pipeline for svbrdf maps.ACM Transactions on Graphics (ToG), 41(2):1–17, 2022

    Yiwei Hu, Chengan He, Valentin Deschaintre, Julie Dorsey, and Holly Rushmeier. An inverse procedural modeling pipeline for svbrdf maps.ACM Transactions on Graphics (ToG), 41(2):1–17, 2022. 1

  15. [15]

    Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion

    Nisha Huang, Fan Tang, Weiming Dong, and Changsheng Xu. Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion. InProceedings of the 30th ACM International Conference on Multimedia, pages 1085– 1094, 2022. 2

  16. [16]

    Diff- styler: Controllable dual diffusion for text-driven image styl- ization.IEEE Transactions on Neural Networks and Learn- ing Systems, 2024

    Nisha Huang, Yuxin Zhang, Fan Tang, Chongyang Ma, Haibin Huang, Weiming Dong, and Changsheng Xu. Diff- styler: Controllable dual diffusion for text-driven image styl- ization.IEEE Transactions on Neural Networks and Learn- ing Systems, 2024. 2

  17. [17]

    Creativesynth: Cross-art-attention for artis- tic image synthesis with multimodal diffusion.IEEE Trans- actions on Visualization and Computer Graphics, 2025

    Nisha Huang, Weiming Dong, Yuxin Zhang, Fan Tang, Ronghui Li, Chongyang Ma, Xiu Li, Tong-Yee Lee, and Changsheng Xu. Creativesynth: Cross-art-attention for artis- tic image synthesis with multimodal diffusion.IEEE Trans- actions on Visualization and Computer Graphics, 2025. 2

  18. [18]

    Artcrafter: Text-image aligning style transfer via embedding reframing

    Nisha Huang, Kaer Huang, Yifan Pu, Jiangshan Wang, Jie Guo, Yiqiang Yan, Xiu Li, and Tong-Yee Lee. Artcrafter: Text-image aligning style transfer via embedding reframing. arXiv preprint arXiv:2501.02064, 2025. 2

  19. [19]

    Unsplash.https://unsplash.com/

    Unsplash Inc. Unsplash.https://unsplash.com/. 6

  20. [20]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. InThe Second International Conference on Learning Representations. 4

  21. [21]

    Flux.https://github.com/ black-forest-labs/flux,

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, . 3, 6, 7

  22. [22]

    Flux.1-depth-dev-lora.https:// huggingface

    Black Forest Labs. Flux.1-depth-dev-lora.https:// huggingface . co / black - forest - labs / FLUX . 1-Depth-dev-lora, . 5

  23. [23]

    Flux.1-dev-controlnet-depth.https: / / huggingface

    Black Forest Labs. Flux.1-dev-controlnet-depth.https: / / huggingface . co / Shakker - Labs / FLUX . 1 - dev-ControlNet-Depth, . 4

  24. [24]

    Ctrl-x: Controlling structure and appear- ance for text-to-image generation without guidance

    Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, and Bolei Zhou. Ctrl-x: Controlling structure and appear- ance for text-to-image generation without guidance. InAd- vances in Neural Information Processing Systems, 2024. 2

  25. [25]

    Flow matching for genera- tive modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations. 3

  26. [26]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Repre- sentations. 3

  27. [27]

    Material palette: extraction of materials from a single image

    Ivan Lopes, Fabio Pizzati, and Raoul de Charette. Material palette: extraction of materials from a single image. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4379–4388, 2024. 2, 3

  28. [28]

    Qwen2vl-flux: Unifying image and text guidance for controllable image generation, 2024

    Pengqi Lu. Qwen2vl-flux: Unifying image and text guidance for controllable image generation, 2024. 2

  29. [29]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. 3, 5

  30. [30]

    Multi- modal attention for speech emotion recognition

    Zexu Pan, Zhaojie Luo, Jichen Yang, and Haizhou Li. Multi- modal attention for speech emotion recognition. InProc. Interspeech 2020, pages 364–368, 2020. 3

  31. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

  32. [32]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth Interna- tional Conference on Learning Representations. 7

  33. [33]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 7

  34. [34]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

  35. [35]

    Texture: Text-guided texturing of 3d shapes

    Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. InACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023. 3

  36. [36]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2, 4, 7

  37. [37]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500– 22510, 2023. 2, 3

  38. [38]

    Alchemist: Parametric control of material proper- ties with diffusion models

    Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, Bill Freeman, and Mark Matthews. Alchemist: Parametric control of material proper- ties with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24130–24141, 2024. 2

  39. [39]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations. 3

  40. [40]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  41. [41]

    Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Process- ing, 13(4):600–612, 2004. 7

  42. [42]

    U-vap: User-specified visual appearance person- alization via decoupled self augmentation

    You Wu, Kean Liu, Xiaoyue Mi, Fan Tang, Juan Cao, and Jintao Li. U-vap: User-specified visual appearance person- alization via decoupled self augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9482–9491, 2024. 2, 3, 6, 7

  43. [43]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arxiv:2308.06721,

  44. [44]

    Wysiwyg design of hypnotic line art

    Chih-Kuo Yeh, Zhanping Liu, I-Hsuan Lin, Eugene Zhang, and Tong-Yee Lee. Wysiwyg design of hypnotic line art. IEEE Transactions on Visualization and Computer Graphics, 28(6):2517–2529, 2020. 1

  45. [45]

    Texture- dreamer: Image-guided texture synthesis through geometry- aware diffusion

    Yu-Ying Yeh, Jia-Bin Huang, Changil Kim, Lei Xiao, Thu Nguyen-Phuoc, Numair Khan, Cheng Zhang, Manmohan Chandraker, Carl S Marshall, Zhao Dong, et al. Texture- dreamer: Image-guided texture synthesis through geometry- aware diffusion. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 4304–4314, 2024. 3

  46. [46]

    Controlnet-v1.1-depth.https : / / huggingface

    Lvmin Zhang. Controlnet-v1.1-depth.https : / / huggingface . co / lllyasviel / control _ v11f1p_sd15_depth. 6

  47. [47]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2, 3, 5, 7

  48. [48]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 586–595, 2018. 7

  49. [49]

    Prospect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (ToG), 42(6):1–14, 2023

    Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, and Changsheng Xu. Prospect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (ToG), 42(6):1–14, 2023. 2, 3, 6, 7

  50. [50]

    Evf-sam: Early vision-language fusion for text- prompted segment anything model

    Yuxuan Zhang, Tianheng Cheng, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xing- gang Wang. Evf-sam: Early vision-language fusion for text- prompted segment anything model. 2024. 8