pith. sign in

arxiv: 2605.19698 · v1 · pith:E4XPP4ZZnew · submitted 2026-05-19 · 💻 cs.CR · cs.LG

Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models

Pith reviewed 2026-05-20 04:35 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords backdoor attackstext-to-image diffusion modelsmulti-concept injectiontrigger searchmodel reusefine-tuning stability
0
0 comments X

The pith

Hydra stabilizes multiple backdoors in text-to-image diffusion models by finding triggers that resist semantic interference during repeated reuse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines backdoor attacks on diffusion models that get reused and fine-tuned by many independent parties, allowing multiple concept-specific trigger-target pairs to build up in one checkpoint. Without special handling, these pairs interfere in the shared representation space, lowering attack reliability and image quality. Hydra addresses this by searching for stable triggers through evolution in the text encoder space and by using multi-task fine-tuning plus trigger-clean regularization to coordinate the injections. Experiments across diffusion backbones show the method keeps attack success near 95 percent even with 500 concept pairs from 8 attackers while leaving clean generation intact. If correct, this reveals how backdoors can accumulate reliably in open ecosystems rather than canceling each other out.

Core claim

Hydra maintains effective backdoor activation while preserving clean generation fidelity and image quality by performing evolutionary trigger search in the text encoder space to identify triggers that are semantically aligned with their target concepts while remaining stable across other injected concepts, and by combining multi-task fine-tuning with trigger-clean regularization to improve training stability under dense multi-concept injection.

What carries the argument

Evolutionary trigger search in the text encoder space plus multi-task fine-tuning with trigger-clean regularization, which together constrain trigger semantics and coordinate cross-task interactions.

If this is right

  • Multiple concept-specific backdoors can coexist in one model without cross-concept entanglement reducing attack success.
  • Attack success rate remains near 95 percent across 8 attackers and 500 concept pairs under cumulative decentralized reuse.
  • Clean generation fidelity and overall image quality stay high despite dense multi-concept injection.
  • Trigger semantics can be explicitly constrained to prevent destabilization that normally occurs with accumulating backdoors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Verification tools for reused checkpoints may need to scan for clusters of stable triggers rather than isolated ones.
  • The same coordination approach could be tested on other generative architectures that support sequential fine-tuning.
  • If attackers adopt this method widely, the risk profile of open model sharing increases because backdoors become more persistent.

Load-bearing premise

Evolutionary search in the text encoder space can reliably locate triggers that stay semantically aligned with their targets and stable when many other concepts are also injected.

What would settle it

Running the same 500-concept-pair experiment on additional diffusion backbones or with 1000 pairs and observing whether attack success rate falls below 90 percent or clean image metrics degrade sharply.

Figures

Figures reproduced from arXiv: 2605.19698 by Chengcheng Zhu, Chuang Ma, Jiale Zhang, Kai Wang, Songze Li.

Figure 1
Figure 1. Figure 1: Visual distortion and attention diffusion in common backdoor methods [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of the implosion mechanism. (a) Concept-level mapping [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Framework of the proposed Hydra. Left: Evolutionary Trigger Search discovers rare, semantics-aware triggers under distribution-preserving and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison under clean and triggered prompts. Left: ”a photo of sunglass”. Right: ”a photo of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Generalization and robustness analysis of Hydra under varying prompt complexity, dataset source, and poisoning scale. The curves report ASR and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Retention of the first-injected backdoor under sequential heterogeneous attacks. Each column fixes a different first-injected method, and S1–S6 denote [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of downstream fine-tuning strategies on backdoor robustness. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of the denoising process for benign and backdoored diffusion models. Backdoor activation gradually emerges over denoising steps only [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sensitivity analysis of ASR and ACC with respect to key hyperpa [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Trigger-induced representation shift and cross-modal propagation [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt template for filtering visually grounded nominal concepts [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Generalization performance on ImageNet under simple and complex [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Impact of single-backdoor retention under sequential injections. [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional qualitative results under complex and compositional prompts. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
read the original abstract

Text-to-image diffusion models are increasingly developed through open-source reuse and repeated downstream fine-tuning, where reused checkpoints are difficult to verify and thus more susceptible to hidden backdoor behaviors. In such ecosystems, a single pretrained model may be sequentially adapted and redistributed by multiple independent parties, allowing multiple concept-specific trigger-target associations to accumulate in the same model. When these associations coexist, semantic conflicts can be amplified in the shared representation space, leading to cross-concept entanglement and degraded generation quality. Notably, instead of strengthening the attack, such accumulation can destabilize previously injected behaviors and reduce attack reliability. In this work, we systematically investigate backdoor attacks under this interference-prone setting and propose Hydra, a unified framework for robust and controlled multi-concept backdoor injection under cumulative and decentralized reuse. Our core insight is that stable backdoor injection under large-scale multi-concept settings requires explicitly constraining trigger semantics while coordinating cross-task interactions during optimization. Specifically, Hydra performs evolutionary trigger search in the text encoder space to identify triggers that are semantically aligned with their target concepts while remaining stable across other injected concepts. It further combines multi-task fine-tuning with trigger-clean regularization to improve training stability under dense multi-concept injection. Extensive experiments across multiple diffusion backbones under rigorous multi-concept settings show that Hydra maintains effective backdoor activation while preserving clean generation fidelity and image quality. For instance, across 8 attackers and 500 concept pairs, Hydra maintains ~95% ASR and strong clean generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Hydra, a framework for stable multi-concept backdoor injection into text-to-image diffusion models under cumulative decentralized reuse. It performs evolutionary trigger search in text-encoder space to identify semantically aligned triggers that remain stable across concepts, then applies multi-task fine-tuning plus trigger-clean regularization to mitigate interference. Experiments across multiple backbones report that Hydra sustains effective backdoor activation (~95% ASR) while preserving clean generation fidelity and image quality for 8 attackers and 500 concept pairs.

Significance. If the stability results hold under full verification, the work is significant for AI security research. It directly addresses a realistic threat model in open-source diffusion ecosystems where models accumulate backdoors through repeated independent fine-tuning. The combination of evolutionary search and cross-task regularization offers a concrete technical approach, and the scale of the multi-concept experiments (500 pairs) provides useful empirical grounding for both attack and defense studies.

major comments (2)
  1. [§3.2] §3.2 (Evolutionary Trigger Search): the description states that the search identifies triggers 'semantically aligned with their target concepts while remaining stable across other injected concepts,' yet supplies no fitness function, population size, generation count, or explicit interference penalty. Because the central claim of cross-concept stability under sequential decentralized injection rests on this optimization step, the missing details prevent assessment of whether the procedure actually enforces the required invariance.
  2. [§4] §4 (Multi-Concept Experiments): the reported ~95% ASR and 'strong clean generation' across 500 concept pairs are not accompanied by the precise ASR definition, the baselines against which it is compared, or the criteria used to select or exclude concept pairs. These omissions are load-bearing for the claim that Hydra avoids entanglement and maintains fidelity under dense multi-concept injection.
minor comments (2)
  1. [Abstract / §4] The abstract and §4 use 'ASR' without an explicit expansion on first use in the main body; add the definition for clarity.
  2. [Figure 1] Figure 1 (framework overview) would benefit from explicit arrows or labels indicating the flow from evolutionary search to the multi-task fine-tuning stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each major comment below and have revised the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Evolutionary Trigger Search): the description states that the search identifies triggers 'semantically aligned with their target concepts while remaining stable across other injected concepts,' yet supplies no fitness function, population size, generation count, or explicit interference penalty. Because the central claim of cross-concept stability under sequential decentralized injection rests on this optimization step, the missing details prevent assessment of whether the procedure actually enforces the required invariance.

    Authors: We appreciate this observation. The original manuscript provided a high-level overview of the evolutionary trigger search to emphasize the framework's novelty. To enable full evaluation and reproducibility, we will expand Section 3.2 with the complete details of the optimization procedure. This includes the fitness function, which balances semantic alignment to the target concept (via embedding similarity) with a stability term that penalizes interference with other concepts, along with the population size, number of generations, and the explicit interference penalty term. These additions will be included in the revised manuscript. revision: yes

  2. Referee: [§4] §4 (Multi-Concept Experiments): the reported ~95% ASR and 'strong clean generation' across 500 concept pairs are not accompanied by the precise ASR definition, the baselines against which it is compared, or the criteria used to select or exclude concept pairs. These omissions are load-bearing for the claim that Hydra avoids entanglement and maintains fidelity under dense multi-concept injection.

    Authors: We agree that these details are essential for interpreting the results. In the revised version, we will provide a precise definition of ASR as the proportion of test prompts containing the trigger that successfully generate the target concept. We will also specify the baselines used, such as independent single-concept injections and multi-concept fine-tuning without the proposed regularization. Additionally, we will detail the concept pair selection process, which involved sampling from a diverse set of concepts while avoiding highly similar pairs to minimize natural semantic conflicts. These clarifications will be added to Section 4. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper proposes Hydra as a framework combining evolutionary trigger search in text-encoder space with multi-task fine-tuning and trigger-clean regularization. Claims of ~95% ASR and preserved clean fidelity across 8 attackers and 500 concept pairs are presented as outcomes of reported experiments rather than reductions of any equation to its own fitted inputs or self-citations. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing uniqueness theorems appear in the provided text; the stability assertions rest on external empirical measurements instead of internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about semantic stability in text encoder space and the ability of regularization to coordinate cross-task interactions; no explicit free parameters or invented entities are detailed in the abstract.

axioms (2)
  • domain assumption Triggers can be found in text encoder space that align with target concepts yet remain stable across multiple injected concepts.
    This is the core premise enabling the evolutionary search component.
  • domain assumption Multi-task fine-tuning combined with trigger-clean regularization improves stability under dense injection.
    Invoked to justify the training procedure for preventing entanglement.

pith-pipeline@v0.9.0 · 5803 in / 1293 out tokens · 34111 ms · 2026-05-20T04:35:10.870017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

  1. [1]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  2. [2]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023

  3. [3]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” 2022. [Online]. Available: https://arxiv.org/abs/2204.06125

  4. [4]

    Stable diffusion 1,

    S. AI, “Stable diffusion 1,” https://stability.ai/news/ stable-diffusion-announcement, 2022

  5. [5]

    Virtually try on clothes with a new ai shopping feature,

    L. Rincon, “Virtually try on clothes with a new ai shopping feature,” Google Blog, 2023

  6. [6]

    Invasive diffusion: How one unwilling illustrator found herself turned into an ai model,

    A. Baio, “Invasive diffusion: How one unwilling illustrator found herself turned into an ai model,” 2022, http://waxy.org

  7. [7]

    Exposing fake images generated by text-to-image diffusion models,

    Q. Xu, H. Wang, L. Meng, Z. Mi, J. Yuan, and H. Yan, “Exposing fake images generated by text-to-image diffusion models,”Pattern Recognition Letters, vol. 176, pp. 76–82, 2023

  8. [8]

    Explaining the sdxl latent space

    Huggingface, “Explaining the sdxl latent space.” https://huggingface.co/, 2023

  9. [9]

    Perturbing attention gives you more bang for the buck: Subtle imaging perturbations that efficiently fool customized diffusion models,

    J. Xu, Y . Lu, Y . Li, S. Lu, D. Wang, and X. Wei, “Perturbing attention gives you more bang for the buck: Subtle imaging perturbations that efficiently fool customized diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 24 534–24 543

  10. [10]

    How to backdoor diffusion models?

    S.-Y . Chou, P.-Y . Chen, and T.-Y . Ho, “How to backdoor diffusion models?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4015–4024

  11. [11]

    The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copyright breaches without adjusting finetuning pipeline,

    H. Wang, Q. Shen, Y . Tong, Y . Zhang, and K. Kawaguchi, “The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copyright breaches without adjusting finetuning pipeline,”arXiv preprint arXiv:2401.04136, 2024

  12. [12]

    Transtroj: Transferable backdoor attacks to pre-trained models via embedding indistinguishability,

    H. Wang, T. Xiang, S. Guo, J. He, H. Liu, and T. Zhang, “Transtroj: Transferable backdoor attacks to pre-trained models via embedding indistinguishability,”arXiv preprint arXiv:2401.15883, 2024

  13. [13]

    arXiv preprint arXiv:2302.07944 , year=

    B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov, “Ef- fective data augmentation with diffusion models,”arXiv preprint arXiv:2302.07944, 2023

  14. [14]

    Civitai,

    Civitai, “Civitai,” https://github.com/civitai/civitai, 2022

  15. [15]

    Huggingface,

    T. A. Vass., “Huggingface,” https://huggingface.co/blog/ TimothyAlexisVass/explaining-the-sdxl-latent-space, 2022

  16. [16]

    Understanding implosion in text-to-image generative models,

    W. Ding, C. Y . Li, S. Shan, B. Y . Zhao, and H. Zheng, “Understanding implosion in text-to-image generative models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1211–1225

  17. [17]

    Villandiffusion: A unified backdoor attack framework for diffusion models,

    S.-Y . Chou, P.-Y . Chen, and T.-Y . Ho, “Villandiffusion: A unified backdoor attack framework for diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 33 912–33 964, 2023

  18. [18]

    Nightshade: Prompt-specific poisoning attacks on text-to-image gener- ative models,

    S. Shan, W. Ding, J. Passananti, S. Wu, H. Zheng, and B. Y . Zhao, “Nightshade: Prompt-specific poisoning attacks on text-to-image gener- ative models,” in2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024, pp. 807–825

  19. [19]

    Eviledit: Backdooring text-to-image diffusion models in one second,

    H. Wang, S. Guo, J. He, K. Chen, S. Zhang, T. Zhang, and T. Xiang, “Eviledit: Backdooring text-to-image diffusion models in one second,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 3657–3665

  20. [20]

    Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis,

    L. Struppek, D. Hintersdorf, and K. Kersting, “Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4584–4596

  21. [21]

    Bagm: A backdoor attack for manipulating text-to-image generative models,

    J. Vice, N. Akhtar, R. Hartley, and A. Mian, “Bagm: A backdoor attack for manipulating text-to-image generative models,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 4865–4880, 2024

  22. [22]

    Poisoning language models during instruction tuning,

    A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 35 413–35 425

  23. [23]

    Scaling rectified flow transformers for high-resolution image synthesis,

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024

  24. [24]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

  25. [25]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,”arXiv preprint arXiv:2011.13456, 2020

  26. [26]

    Cogview: Mastering text-to-image generation via transformers,

    M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yanget al., “Cogview: Mastering text-to-image generation via transformers,”Advances in neural information processing systems, vol. 34, pp. 19 822–19 835, 2021

  27. [27]

    Neural distributed image compression with cross-attention feature alignment,

    N. Mital, E. ¨Ozyilkan, A. Garjani, and D. G ¨und¨uz, “Neural distributed image compression with cross-attention feature alignment,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2498–2507

  28. [28]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y . Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,”arXiv preprint arXiv:2208.01626, 2022

  29. [29]

    Text-image alignment for diffusion-based perception,

    N. Kondapaneni, M. Marks, M. Knott, R. Guimaraes, and P. Perona, “Text-image alignment for diffusion-based perception,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 13 883–13 893

  30. [30]

    Towards understanding cross and self-attention in stable diffusion for text-guided image editing,

    B. Liu, C. Wang, T. Cao, K. Jia, and J. Huang, “Towards understanding cross and self-attention in stable diffusion for text-guided image editing,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 7817–7826

  31. [31]

    Unleashing text-to-image diffusion models for visual perception,

    W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5729–5739

  32. [32]

    Diffusion model with cross attention as an inductive bias for disentanglement,

    T. Yang, C. Lan, Y . Lu, and N. Zheng, “Diffusion model with cross attention as an inductive bias for disentanglement,”Advances in Neural Information Processing Systems, vol. 37, pp. 82 465–82 492, 2024

  33. [33]

    Shadowcast: Stealthy data poisoning attacks against vision-language models,

    Y . Xu, J. Yao, M. Shu, Y . Sun, Z. Wu, N. Yu, T. Goldstein, and F. Huang, “Shadowcast: Stealthy data poisoning attacks against vision-language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 57 733–57 764, 2024

  34. [34]

    {UnGANable}: Defending against{GAN-based}face manipulation,

    Z. Li, N. Yu, A. Salem, M. Backes, M. Fritz, and Y . Zhang, “{UnGANable}: Defending against{GAN-based}face manipulation,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 7213–7230

  35. [35]

    Text-to-image diffusion models can be easily backdoored through multimodal data poisoning,

    S. Zhai, Y . Dong, Q. Shen, S. Pu, Y . Fang, and H. Su, “Text-to-image diffusion models can be easily backdoored through multimodal data poisoning,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1577–1587

  36. [36]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  37. [37]

    A survey on multi-task learning,

    Y . Zhang and Q. Yang, “A survey on multi-task learning,”IEEE transactions on knowledge and data engineering, vol. 34, no. 12, pp. 5586–5609, 2021

  38. [38]

    Multi-task learning as multi-objective opti- mization,

    O. Sener and V . Koltun, “Multi-task learning as multi-objective opti- mization,”Advances in neural information processing systems, vol. 31, 2018. 14

  39. [39]

    Regularized multi–task learning,

    T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 109–117

  40. [40]

    Diffusionmtl: Learning multi-task denoising diffusion model from partially annotated data,

    H. Ye and D. Xu, “Diffusionmtl: Learning multi-task denoising diffusion model from partially annotated data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 960–27 969

  41. [41]

    Multi- concept customization of text-to-image diffusion,

    N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi- concept customization of text-to-image diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1931–1941

  42. [42]

    Vector quantized diffusion model for text-to-image synthesis,

    S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 696–10 706

  43. [43]

    Addressing negative transfer in diffusion models,

    H. Go, Y . Lee, S. Lee, S. Oh, H. Moon, and S. Choi, “Addressing negative transfer in diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 27 199–27 222, 2023

  44. [44]

    Vision transformer adapters for generalizable multitask learning,

    D. Bhattacharjee, S. S ¨usstrunk, and M. Salzmann, “Vision transformer adapters for generalizable multitask learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 015–19 026

  45. [45]

    Graph optimal transport for cross-domain alignment,

    L. Chen, Z. Gan, Y . Cheng, L. Li, L. Carin, and J. Liu, “Graph optimal transport for cross-domain alignment,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 1542–1553

  46. [46]

    Ai models collapse when trained on recursively generated data,

    I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y . Gal, “Ai models collapse when trained on recursively generated data,” Nature, vol. 631, no. 8022, pp. 755–759, 2024

  47. [47]

    Gra- dient surgery for multi-task learning,

    T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020

  48. [48]

    Laion-aesthetics,

    C. Schuhmann, “Laion-aesthetics,” https://laion.ai/blog/laionaesthetics/, 2022

  49. [49]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755

  50. [50]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2009, pp. 248–255

  51. [51]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900

  52. [52]

    Toward verifiable and reproducible human eval- uation for text-to-image generation,

    M. Otani, R. Togashi, Y . Sawai, R. Ishigami, Y . Nakashima, S. Satoh, Z. He, and S. Hirota, “Toward verifiable and reproducible human eval- uation for text-to-image generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 14 277–14 286

  53. [53]

    Clipscore: A reference-free evaluation metric for image captioning,

    J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 7514–7528

  54. [54]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,

    V . W. Liang, Y . Zhang, Y . Kwon, S. Yeung, and J. Y . Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 17 612–17 625, 2022

  55. [55]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

  56. [56]

    T2ishield: Defending against backdoors on text-to-image diffusion models,

    Z. Wang, J. Zhang, S. Shan, and X. Chen, “T2ishield: Defending against backdoors on text-to-image diffusion models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 107–124

  57. [57]

    Dynamic attention anal- ysis for backdoor detection in text-to-image diffusion models,

    Z. Wang, J. Zhang, S. Shan, and X. Chen, “Dynamic attention anal- ysis for backdoor detection in text-to-image diffusion models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 15 Algorithm 1EVOLUTIONARYTRIGGERSEARCH 1:Input:Assigned concept pairsC +, negative concept pairs C−, rare-word vocabularyV, population sizeP, maximum generat...