pith. sign in

arxiv: 2508.20640 · v2 · submitted 2025-08-28 · 💻 cs.CV

CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models

Pith reviewed 2026-05-18 20:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords graffiti generationfacial identity preservationdiffusion modelsstyle transferself-attention mechanismLoRA fine-tuningCLIP-guided reposing
0
0 comments X

The pith

A diffusion-based system creates graffiti art from photos while keeping the subject's face recognizable by styling first then locking in identity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CraftGraffiti as an end-to-end framework that turns input images into text-guided graffiti using a LoRA-fine-tuned diffusion transformer. It tackles the problem of facial distortions in high-contrast abstract styles by first applying the style, then adding a face-consistent self-attention layer that injects explicit identity embeddings. The work validates that this style-first order reduces attribute drift compared with reversing the steps, while also supporting keypoint-free pose changes through CLIP prompt extension. Results show competitive facial consistency alongside leading scores for aesthetics and human preference, with a festival deployment demonstrating practical use in creative settings.

Core claim

CraftGraffiti shows that graffiti style transfer followed by identity enforcement via augmented self-attention layers produces outputs with reduced facial attribute drift, competitive identity preservation metrics, and high aesthetic and preference scores, outperforming the identity-first ordering.

What carries the argument

The face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings to protect facial features after style application.

If this is right

  • Ordering style transfer before identity enforcement measurably lowers attribute drift relative to the reverse sequence.
  • The resulting images achieve competitive facial feature consistency with existing methods.
  • Aesthetic quality and human preference reach state-of-the-art levels for text-guided graffiti generation.
  • Pose customization works without keypoints while facial coherence is retained.
  • The pipeline supports live creative deployments such as festival installations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ordering principle could be tested on other extreme stylizations such as pixel art or mosaic rendering to see whether identity drift drops similarly.
  • Integration with existing portrait editing tools might extend the approach to video or animation sequences.
  • Cultural applications could include generating identity-preserving art for communities that value both stylistic freedom and recognizability.

Load-bearing premise

The explicit identity embeddings added to attention layers will reliably block subtle distortions to eyes, nose, or mouth even when graffiti stylization is extreme.

What would settle it

Generate a set of graffiti images from the same input faces using the system and check whether independent viewers can still match them to the originals at rates no better than chance or whether measurable eye-nose-mouth drift exceeds that of a simple reverse-order baseline.

Figures

Figures reproduced from arXiv: 2508.20640 by Ayan Banerjee, Fernando Vilari\~no, Josep Llad\'os.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CraftGraffiti transforms a source image into a graffiti-style portrait while preserving the subject’s identity and pose. Graffiti style is injected via a pretrained diffusion fine-tuned with LoRA for the dedicated style. Later on, another diffusion model is equipped with face-consistent self-attention and cross-attention modules to preserve key facial features, and a LoRA module enables pose customization … view at source ↗
Figure 3
Figure 3. Figure 3: face-consistent self-attention: we can easily preserve the facial attribute of the character through the extra dimension of identity embedding. To ensure that the two generated characters de￾pict the same identity, we introduce an explicit identity embedding into the attention computa￾tion. Conceptually, this means adding a special identity vector (often derived from a reference face) as an extra token or … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison: CraftGraffiti perfectly transforms the input image into the graffiti style and maintains facial attributes, while the rest cannot do both. We also compare CraftGraffiti with InstructPix2Pix [6]; however, it neither adds objects nor generates high-quality images. Similarly, VLMs (Grok 3 [16] and GPT4o [22]) maintain consistency, unable to blend style due to the complexity of graffiti… view at source ↗
Figure 5
Figure 5. Figure 5: It has been observed that with the FLUX.1 dev (12B) baseline, we neither achieve consistency [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of the self-attention: The face consistent self-attention primarily focuses on the human faces and their corresponding poses, whereas the traditional self-attention and subject-driven self-attention of Consistory [45] diverges towards the global scenario. Cultural and Societal Implications: Beyond technical performance, Craft￾Graffiti acts as an enabler in ongo￾ing cultural debates around represen… view at source ↗
Figure 7
Figure 7. Figure 7: Human Evaluation: CraftGraffiti outperforms SOTA techniques [55, 27, 6, 16, 22] in style blending and aesthetics while preserving facial attributes (a decent performance in recognizability). Limitations: While our model successfully preserves facial attributes, real-time generation in public settings is computationally demand￾ing, and user interactions are influenced by the physical constraints of the inst… view at source ↗
Figure 8
Figure 8. Figure 8: CraftGrafitti artsitic installation: Conceptual rendering and Actual setup at Cruïlla Festival 2025 (1) The general impression of the users was that the demo was fun and engaging, often reacting with surprise and amusement at the results. The majority of participants understood that the demonstration was intended as a playful, exploratory experience rather than a precise or professional tool. (2) Some user… view at source ↗
Figure 9
Figure 9. Figure 9: Example outcomes from the demonstration at Cruilla Festival Barcelona 2025 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Some more qualitative examples generated with [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Preserving facial identity under extreme stylistic transformation remains a major challenge in generative art. In graffiti, a high-contrast, abstract medium, subtle distortions to the eyes, nose, or mouth can erase the subject's recognizability, undermining both personal and cultural authenticity. We present CraftGraffiti, an end-to-end text-guided graffiti generation framework designed with facial feature preservation as a primary objective. Given an input image and a style and pose descriptive prompt, CraftGraffiti first applies graffiti style transfer via LoRA-fine-tuned pretrained diffusion transformer, then enforces identity fidelity through a face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings. Pose customization is achieved without keypoints, using CLIP-guided prompt extension to enable dynamic re-posing while retaining facial coherence. We formally justify and empirically validate the "style-first, identity-after" paradigm, showing it reduces attribute drift compared to the reverse order. Quantitative results demonstrate competitive facial feature consistency and state-of-the-art aesthetic and human preference scores, while qualitative analyses and a live deployment at the Cruilla Festival highlight the system's real-world creative impact. CraftGraffiti advances the goal of identity-respectful AI-assisted artistry, offering a principled approach for blending stylistic freedom with recognizability in creative AI applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CraftGraffiti, an end-to-end text-guided framework for generating custom graffiti art from an input face image and descriptive prompts. It adopts a 'style-first, identity-after' pipeline: LoRA-fine-tuned pretrained diffusion transformer for graffiti style transfer, followed by a face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings to enforce facial fidelity. Pose customization is handled via CLIP-guided prompt extension without keypoints. The work formally justifies and empirically validates that this ordering reduces attribute drift relative to the reverse sequence, reporting competitive facial feature consistency together with state-of-the-art aesthetic and human-preference scores, supported by quantitative tables, qualitative examples, and a live deployment at the Cruilla Festival.

Significance. If the empirical claims are substantiated, the paper would make a meaningful contribution to identity-preserving generative models for highly abstract artistic domains. The explicit validation of the style-first ordering, the real-world festival deployment, and the focus on recognizability in graffiti together address a practical gap between stylistic freedom and cultural/personal authenticity in creative AI applications.

major comments (2)
  1. [Quantitative Evaluation] Quantitative Evaluation section: the reported facial-feature consistency scores rely on standard identity embeddings (ArcFace or equivalent). These embeddings are trained on photorealistic data; the paper does not demonstrate that the same embeddings remain reliable under extreme high-contrast graffiti stylization where eyes, nose, and mouth can be heavily abstracted. Without a domain-specific validation (e.g., human study on identity recognition in the generated graffiti or an alternative metric), the central claim that the style-first paradigm reduces attribute drift cannot be considered fully supported.
  2. [Framework and Ablation] Framework and Ablation sections: the face-consistent self-attention is described as the safeguard against subtle distortions, yet the manuscript provides no targeted ablation isolating its contribution specifically on identity-critical regions (eyes/nose/mouth) under the most extreme stylization prompts. This leaves the mechanistic justification for the paradigm partially unproven.
minor comments (2)
  1. [Abstract] The abstract states that the paradigm is 'formally justified'; if this justification appears in §3 or §4, a one-sentence pointer in the abstract would improve readability.
  2. [Tables] Tables reporting quantitative scores should include standard deviations or confidence intervals and the exact number of human raters for the preference study.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which raises valid points about metric reliability and ablation depth in stylized domains. We address each major comment below and will revise the manuscript accordingly to provide stronger empirical support.

read point-by-point responses
  1. Referee: [Quantitative Evaluation] Quantitative Evaluation section: the reported facial-feature consistency scores rely on standard identity embeddings (ArcFace or equivalent). These embeddings are trained on photorealistic data; the paper does not demonstrate that the same embeddings remain reliable under extreme high-contrast graffiti stylization where eyes, nose, and mouth can be heavily abstracted. Without a domain-specific validation (e.g., human study on identity recognition in the generated graffiti or an alternative metric), the central claim that the style-first paradigm reduces attribute drift cannot be considered fully supported.

    Authors: We acknowledge that ArcFace embeddings are trained on photorealistic data and may have reduced reliability for heavily abstracted graffiti styles. While our existing human preference scores provide indirect support for recognizability, we agree this does not fully substitute for domain-specific validation. In the revised manuscript, we will add a targeted human study in which participants match generated graffiti images to original face photos and rate identity similarity. Results will be reported alongside the existing metrics in the Quantitative Evaluation section to directly support the claim that the style-first ordering reduces attribute drift. revision: yes

  2. Referee: [Framework and Ablation] Framework and Ablation sections: the face-consistent self-attention is described as the safeguard against subtle distortions, yet the manuscript provides no targeted ablation isolating its contribution specifically on identity-critical regions (eyes/nose/mouth) under the most extreme stylization prompts. This leaves the mechanistic justification for the paradigm partially unproven.

    Authors: The face-consistent self-attention augments attention layers with identity embeddings precisely to protect critical facial regions. To strengthen the mechanistic evidence, we will add a focused ablation study in the revised manuscript. This will compare outputs with and without the self-attention module under extreme stylization prompts, with quantitative evaluation of distortions localized to the eyes, nose, and mouth regions (using region-specific feature similarity) as well as qualitative examples. These results will be included in the Ablation section. revision: yes

Circularity Check

0 steps flagged

No circularity: framework relies on external pretrained components and empirical validation

full rationale

The paper presents CraftGraffiti as an end-to-end framework that applies LoRA-fine-tuned pretrained diffusion transformers for style transfer followed by a face-consistent self-attention mechanism using explicit identity embeddings. The 'style-first, identity-after' paradigm is described as formally justified and empirically validated against the reverse order, with quantitative results on facial consistency, aesthetics, and human preference. No equations, fitted parameters, or self-referential definitions are shown that reduce any claimed prediction or improvement to quantities defined inside the paper itself. All core modules reference external pretrained models (diffusion transformers, CLIP) rather than internal fits or self-citations that would create a closed loop. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that pretrained diffusion models plus added identity embeddings will behave as described without further derivation.

pith-pipeline@v0.9.0 · 5761 in / 1082 out tokens · 42621 ms · 2026-05-18T20:41:50.970887+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DocRevive: A Unified Pipeline for Document Text Restoration

    cs.CV 2026-04 unverdicted novelty 5.0

    DocRevive builds a unified pipeline using OCR, image analysis, language models, and diffusion to reconstruct degraded document text, backed by a 30k-image synthetic dataset and the UCSM metric.

  2. DocRevive: A Unified Pipeline for Document Text Restoration

    cs.CV 2026-04 unverdicted novelty 5.0

    A unified pipeline using OCR, inpainting, and diffusion models restores text in degraded documents on a new synthetic benchmark dataset, evaluated with the proposed UCSM metric.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    https://en.wikipedia.org/wiki/Living_lab, 2025

    Living lab. https://en.wikipedia.org/wiki/Living_lab, 2025. Accessed: 2025-08-09

  2. [2]

    Ethical challenges and solutions of generative ai: An interdisciplinary perspective.Informatics, 11(3):58,

    Mousa Al-kfairy, Dheya Mustafa, Nir Kshetri, Mazen Insiew, and Omar Alfandi. Ethical challenges and solutions of generative ai: An interdisciplinary perspective.Informatics, 11(3):58,

  3. [3]

    Svgcraft: Beyond single object text-to-svg synthesis with comprehensive canvas layout, 2025

    Ayan Banerjee, Nityanand Mathur, Josep Lladós, Umapada Pal, and Anjan Dutta. Svgcraft: Beyond single object text-to-svg synthesis with comprehensive canvas layout, 2025

  4. [4]

    Bombing, tagging, writing: An analysis of the significance of graffiti and street art

    Lindsay Bates. Bombing, tagging, writing: An analysis of the significance of graffiti and street art. PhD thesis, University of Pennsylvania, 2014

  5. [5]

    Fairness in machine learning: Lessons from political philosophy

    Reuben Binns. Fairness in machine learning: Lessons from political philosophy. Proceedings of the 2017 FMML Workshop on Fair ML, 2017. arXiv preprint arXiv:1712.03586

  6. [6]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  7. [7]

    Vggface2: A dataset for recognising faces across pose and age

    Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018

  8. [8]

    Upgpt: Universal diffusion model for person image generation, editing and pose transfer

    Soon Yau Cheong, Armin Mustafa, and Andrew Gilbert. Upgpt: Universal diffusion model for person image generation, editing and pose transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4173–4182, 2023

  9. [9]

    Living labs and user engagement for innovation and sustainability

    Luca Compagnucci, Francesca Spigarelli, Jorge Coelho, and Carlos Duarte. Living labs and user engagement for innovation and sustainability. Journal of Cleaner Production, 317:128223, 2021

  10. [10]

    Power of graffiti: Exploring its cultural and social significance

    Saday Chandra Das. Power of graffiti: Exploring its cultural and social significance. Aayushi International Interdisciplinary Research Journal (AIIRJ), X (IX), pages 34–35, 2023. 9

  11. [11]

    Prompt tuning inversion for text- driven image editing using diffusion models

    Wenkai Dong, Song Xue, Xiaoyue Duan, and Shumin Han. Prompt tuning inversion for text- driven image editing using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7430–7440, 2023

  12. [12]

    Diffusion in style

    Martin Nicolas Everaert, Marco Bocchio, Sami Arpa, Sabine Süsstrunk, and Radhakrishna Achanta. Diffusion in style. In Proceedings of the ieee/cvf international conference on computer vision, pages 2251–2261, 2023

  13. [13]

    Fairness and bias in artificial intelligence: A survey

    Emilio Ferrara. Fairness and bias in artificial intelligence: A survey. Digital, 6(1):1–41, 2023

  14. [14]

    Evaluating the cultural signifi- cance of historic graffiti

    Alan M Forster, Samantha Vettese-Forster, and John Borland. Evaluating the cultural signifi- cance of historic graffiti. Structural Survey, 30(1):43–64, 2012

  15. [15]

    i don’t see myself represented here at all

    Sourojit Ghosh, Nina Lutz, and Aylin Caliskan. “i don’t see myself represented here at all”: User experiences of stable diffusion outputs containing representational harms across gender identities and nationalities. In Proceedings of the AAAI/ACM conference on AI, ethics, and society, volume 7, pages 463–475, 2024

  16. [16]

    beta—the age of reasoning agents, 3

    XAI Grok. beta—the age of reasoning agents, 3

  17. [17]

    Focus on your instruction: Fine-grained and multi-instruction image editing by attention modulation

    Qin Guo and Tianwei Lin. Focus on your instruction: Fine-grained and multi-instruction image editing by attention modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6986–6996, 2024

  18. [18]

    Diffusion-enhanced patchmatch: A framework for arbitrary style transfer with diffusion models

    Mark Hamazaspyan and Shant Navasardyan. Diffusion-enhanced patchmatch: A framework for arbitrary style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 797–805, 2023

  19. [19]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022

  20. [20]

    Diffstyler: Controllable dual diffusion for text-driven image stylization

    Nisha Huang, Yuxin Zhang, Fan Tang, Chongyang Ma, Haibin Huang, Weiming Dong, and Changsheng Xu. Diffstyler: Controllable dual diffusion for text-driven image stylization. IEEE Transactions on Neural Networks and Learning Systems, 2024

  21. [21]

    Diffusion model-based image editing: A survey

    Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Liangliang Cao, and Shifeng Chen. Diffusion model-based image editing: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  22. [22]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  23. [23]

    Humansd: A native skeleton-guided diffusion model for human image generation

    Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, and Qiang Xu. Humansd: A native skeleton-guided diffusion model for human image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15988–15998, 2023

  24. [24]

    Imagic: Text-based real image editing with diffusion models

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023

  25. [25]

    Reposedm: Recurrent pose alignment and gradient guidance for pose guided image synthesis

    Anant Khandelwal. Reposedm: Recurrent pose alignment and gradient guidance for pose guided image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2495–2504, 2024

  26. [26]

    Ecoval: Ecological validity of cues and representative design in user experience evaluations

    Suzanne Kieffer. Ecoval: Ecological validity of cues and representative design in user experience evaluations. AIS Transactions on Human-Computer Interaction, 9(2):149–172, 2017

  27. [27]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742, 2025. 10

  28. [28]

    Universal style transfer via feature transforms

    Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. Advances in neural information processing systems, 30, 2017

  29. [29]

    Global and local consistent age generative adversarial network (glca-gan)

    Zhen Li, Ping Wang, Qiong Hu, and Ran He. Global and local consistent age generative adversarial network (glca-gan). In Proceedings of the 26th ACM International Conference on Multimedia, pages 305–313, 2018

  30. [30]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023

  31. [31]

    Ava: A large-scale database for aesthetic visual analysis

    Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A large-scale database for aesthetic visual analysis. In 2012 IEEE conference on computer vision and pattern recognition, pages 2408–2415. IEEE, 2012

  32. [32]

    Uncovering bias in face generation models

    Cristian Muñoz, Nicola Zannone, Mohamed Mohammed, and Adriano Koshiyama. Uncovering bias in face generation models. arXiv preprint arXiv:2302.11562, 2023

  33. [33]

    Contrastive denoising score for text-guided latent diffusion image editing

    Hyelin Nam, Gihyun Kwon, Geon Yeong Park, and Jong Chul Ye. Contrastive denoising score for text-guided latent diffusion image editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9192–9201, 2024

  34. [34]

    Do transformer modifi- cations transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021

    Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modifi- cations transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021

  35. [35]

    Diffbody: Diffusion-based pose and shape editing of human images

    Yuta Okuyama, Yuki Endo, and Yoshihiro Kanamori. Diffbody: Diffusion-based pose and shape editing of human images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6333–6342, 2024

  36. [36]

    K-lora: Unlocking training-free fusion of any subject and style loras.arXiv preprint arXiv:2502.18461, 2025

    Ziheng Ouyang, Zhen Li, and Qibin Hou. K-lora: Unlocking training-free fusion of any subject and style loras. arXiv preprint arXiv:2502.18461, 2025

  37. [37]

    Enhancing dreambooth with lora for generating unlimited characters with stable diffusion

    Rubén Pascual, Adrián Maiza, Mikel Sesma-Sara, Daniel Paternain, and Mikel Galar. Enhancing dreambooth with lora for generating unlimited characters with stable diffusion. In 2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2024

  38. [38]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

  39. [39]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

  40. [40]

    Crowd sensing and living lab outdoor experimentation made easy

    Evangelos Pournaras, Atif Nabi Ghulam, Renato Kunz, and Regula Hänggli. Crowd sensing and living lab outdoor experimentation made easy. arXiv preprint arXiv:2107.04117, 2021

  41. [41]

    Training- free identity preservation in stylized image generation using diffusion models

    Mohammad Ali Rezaei, Helia Hajikazem, Saeed Khanehgir, and Mahdi Javanmardi. Training- free identity preservation in stylized image generation using diffusion models. arXiv preprint arXiv:2506.06802, 2025

  42. [42]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  43. [43]

    Facenet: A unified embedding for face recognition and clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. 11

  44. [44]

    Smiling women pitching down: Auditing gender bias in image generative ai

    Chien Sun, William Tzeng, et al. Smiling women pitching down: Auditing gender bias in image generative ai. arXiv preprint arXiv:2305.10566, 2023

  45. [45]

    Training-free consistent text-to-image generation

    Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consistent text-to-image generation. ACM Transactions on Graphics (TOG) , 43(4):1–18, 2024

  46. [46]

    Instantstyle-plus: Style transfer with content-preserving in text-to-image generation

    Haofan Wang, Peng Xing, Renyuan Huang, Hao Ai, Qixun Wang, and Xu Bai. Instantstyle- plus: Style transfer with content-preserving in text-to-image generation. arXiv preprint arXiv:2407.00788, 2024

  47. [47]

    Stable-pose: Leveraging transformers for pose-guided text-to-image generation

    Jiajun Wang, Morteza Ghahremani Boozandani, Yitong Li, Björn Ommer, and Christian Wachinger. Stable-pose: Leveraging transformers for pose-guided text-to-image generation. Advances in Neural Information Processing Systems, 37:65670–65698, 2024

  48. [48]

    Interactive image style transfer guided by graffiti

    Quan Wang, Yanli Ren, Xinpeng Zhang, and Guorui Feng. Interactive image style transfer guided by graffiti. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6685–6694, 2023

  49. [49]

    Au- tostory: Generating diverse storytelling images with minimal human efforts

    Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, and Chunhua Shen. Au- tostory: Generating diverse storytelling images with minimal human efforts. International Journal of Computer Vision, pages 1–22, 2024

  50. [50]

    Stylediffusion: Controllable disentangled style transfer via diffusion models

    Zhizhong Wang, Lei Zhao, and Wei Xing. Stylediffusion: Controllable disentangled style transfer via diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7677–7689, 2023

  51. [51]

    A latent space of stochastic diffusion models for zero-shot image editing and guidance

    Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7378–7387, 2023

  52. [52]

    Human preference score: Better aligning text-to-image models with human preference

    Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023

  53. [53]

    Drb-gan: A dynamic res- block generative adversarial network for artistic style transfer

    Wenju Xu, Chengjiang Long, Ruisheng Wang, and Guanghui Wang. Drb-gan: A dynamic res- block generative adversarial network for artistic style transfer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6383–6392, 2021

  54. [54]

    Controllable artistic text style transfer via shape-matching gan

    Shuai Yang, Zhangyang Wang, Zhaowen Wang, Ning Xu, Jiaying Liu, and Zongming Guo. Controllable artistic text style transfer via shape-matching gan. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4442–4451, 2019

  55. [55]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023

  56. [56]

    Inversion-based style transfer with diffusion models

    Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang, Chongyang Ma, Weiming Dong, and Changsheng Xu. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10146–10156, 2023

  57. [57]

    Sine: Single image editing with text-to-image diffusion models

    Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, and Jian Ren. Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6027–6037, 2023

  58. [58]

    Bias in generative ai

    Xiang Zhou. Bias in generative ai. arXiv preprint arXiv:2403.02726, 2024

  59. [59]

    face1 . png

    Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation. Advances in Neural Information Processing Systems, 37:110315–110340, 2024. 12 A Demonstration at Cruïlla Festival An installation (see Fig. 8) implementing the proposed system was deployed during the F...