Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models

Chengcheng Zhu; Chuang Ma; Jiale Zhang; Kai Wang; Songze Li

arxiv: 2605.19698 · v1 · pith:E4XPP4ZZnew · submitted 2026-05-19 · 💻 cs.CR · cs.LG

Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models

Kai Wang , Jiale Zhang , Chengcheng Zhu , Chuang Ma , Songze Li This is my paper

Pith reviewed 2026-05-20 04:35 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords backdoor attackstext-to-image diffusion modelsmulti-concept injectiontrigger searchmodel reusefine-tuning stability

0 comments

The pith

Hydra stabilizes multiple backdoors in text-to-image diffusion models by finding triggers that resist semantic interference during repeated reuse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines backdoor attacks on diffusion models that get reused and fine-tuned by many independent parties, allowing multiple concept-specific trigger-target pairs to build up in one checkpoint. Without special handling, these pairs interfere in the shared representation space, lowering attack reliability and image quality. Hydra addresses this by searching for stable triggers through evolution in the text encoder space and by using multi-task fine-tuning plus trigger-clean regularization to coordinate the injections. Experiments across diffusion backbones show the method keeps attack success near 95 percent even with 500 concept pairs from 8 attackers while leaving clean generation intact. If correct, this reveals how backdoors can accumulate reliably in open ecosystems rather than canceling each other out.

Core claim

Hydra maintains effective backdoor activation while preserving clean generation fidelity and image quality by performing evolutionary trigger search in the text encoder space to identify triggers that are semantically aligned with their target concepts while remaining stable across other injected concepts, and by combining multi-task fine-tuning with trigger-clean regularization to improve training stability under dense multi-concept injection.

What carries the argument

Evolutionary trigger search in the text encoder space plus multi-task fine-tuning with trigger-clean regularization, which together constrain trigger semantics and coordinate cross-task interactions.

If this is right

Multiple concept-specific backdoors can coexist in one model without cross-concept entanglement reducing attack success.
Attack success rate remains near 95 percent across 8 attackers and 500 concept pairs under cumulative decentralized reuse.
Clean generation fidelity and overall image quality stay high despite dense multi-concept injection.
Trigger semantics can be explicitly constrained to prevent destabilization that normally occurs with accumulating backdoors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Verification tools for reused checkpoints may need to scan for clusters of stable triggers rather than isolated ones.
The same coordination approach could be tested on other generative architectures that support sequential fine-tuning.
If attackers adopt this method widely, the risk profile of open model sharing increases because backdoors become more persistent.

Load-bearing premise

Evolutionary search in the text encoder space can reliably locate triggers that stay semantically aligned with their targets and stable when many other concepts are also injected.

What would settle it

Running the same 500-concept-pair experiment on additional diffusion backbones or with 1000 pairs and observing whether attack success rate falls below 90 percent or clean image metrics degrade sharply.

Figures

Figures reproduced from arXiv: 2605.19698 by Chengcheng Zhu, Chuang Ma, Jiale Zhang, Kai Wang, Songze Li.

**Figure 2.** Figure 2: Analysis of the implosion mechanism. (a) Concept-level mapping [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Framework of the proposed Hydra. Left: Evolutionary Trigger Search discovers rare, semantics-aware triggers under distribution-preserving and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison under clean and triggered prompts. Left: ”a photo of sunglass”. Right: ”a photo of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Generalization and robustness analysis of Hydra under varying prompt complexity, dataset source, and poisoning scale. The curves report ASR and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Retention of the first-injected backdoor under sequential heterogeneous attacks. Each column fixes a different first-injected method, and S1–S6 denote [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 8.** Figure 8: Impact of downstream fine-tuning strategies on backdoor robustness. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 10.** Figure 10: Visualization of the denoising process for benign and backdoored diffusion models. Backdoor activation gradually emerges over denoising steps only [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Sensitivity analysis of ASR and ACC with respect to key hyperpa [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Trigger-induced representation shift and cross-modal propagation [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt template for filtering visually grounded nominal concepts [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Generalization performance on ImageNet under simple and complex [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Impact of single-backdoor retention under sequential injections. [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: Additional qualitative results under complex and compositional prompts. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

read the original abstract

Text-to-image diffusion models are increasingly developed through open-source reuse and repeated downstream fine-tuning, where reused checkpoints are difficult to verify and thus more susceptible to hidden backdoor behaviors. In such ecosystems, a single pretrained model may be sequentially adapted and redistributed by multiple independent parties, allowing multiple concept-specific trigger-target associations to accumulate in the same model. When these associations coexist, semantic conflicts can be amplified in the shared representation space, leading to cross-concept entanglement and degraded generation quality. Notably, instead of strengthening the attack, such accumulation can destabilize previously injected behaviors and reduce attack reliability. In this work, we systematically investigate backdoor attacks under this interference-prone setting and propose Hydra, a unified framework for robust and controlled multi-concept backdoor injection under cumulative and decentralized reuse. Our core insight is that stable backdoor injection under large-scale multi-concept settings requires explicitly constraining trigger semantics while coordinating cross-task interactions during optimization. Specifically, Hydra performs evolutionary trigger search in the text encoder space to identify triggers that are semantically aligned with their target concepts while remaining stable across other injected concepts. It further combines multi-task fine-tuning with trigger-clean regularization to improve training stability under dense multi-concept injection. Extensive experiments across multiple diffusion backbones under rigorous multi-concept settings show that Hydra maintains effective backdoor activation while preserving clean generation fidelity and image quality. For instance, across 8 attackers and 500 concept pairs, Hydra maintains ~95% ASR and strong clean generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hydra gives a workable way to inject multiple stable backdoors into reused diffusion models, but the evolutionary search needs clearer proof that it resists interference from later independent injections.

read the letter

The main point is that this paper tackles backdoor attacks on text-to-image models when several parties fine-tune and share the same checkpoint one after another. Hydra uses evolutionary search in the text encoder to pick triggers that match their target concepts yet avoid clashing with other injected ones, then adds multi-task fine-tuning and trigger-clean regularization to hold performance steady under dense injection. The reported results look decent on paper: roughly 95 percent attack success rate across 500 concept pairs with eight attackers, while clean image quality stays high. That moves past the single-concept setups in earlier work and matches a realistic open-reuse threat model. The experiments across multiple backbones add some breadth. The soft spot is the evolutionary search itself. The abstract and stress-test note leave out the fitness function, population size, and any explicit term that penalizes entanglement with concepts that arrive later. Without those details it is hard to judge whether the triggers really stay stable under truly sequential, decentralized additions or whether the optimization was tuned to the specific test pairs. If the full methods section shows ablations on sequential injection order and independent search runs, that would fix the gap; otherwise the stability claim rests on thinner ground than the ASR numbers suggest. This work is aimed at the AI security crowd that studies generative model misuse and checkpoint sharing. Readers who already follow backdoor papers on diffusion models will get the most from the framework and the multi-concept numbers. It is solid enough to deserve a serious referee, mainly because the threat model is timely and the empirical setup is concrete, even if the search mechanics need tighter documentation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Hydra, a framework for stable multi-concept backdoor injection into text-to-image diffusion models under cumulative decentralized reuse. It performs evolutionary trigger search in text-encoder space to identify semantically aligned triggers that remain stable across concepts, then applies multi-task fine-tuning plus trigger-clean regularization to mitigate interference. Experiments across multiple backbones report that Hydra sustains effective backdoor activation (~95% ASR) while preserving clean generation fidelity and image quality for 8 attackers and 500 concept pairs.

Significance. If the stability results hold under full verification, the work is significant for AI security research. It directly addresses a realistic threat model in open-source diffusion ecosystems where models accumulate backdoors through repeated independent fine-tuning. The combination of evolutionary search and cross-task regularization offers a concrete technical approach, and the scale of the multi-concept experiments (500 pairs) provides useful empirical grounding for both attack and defense studies.

major comments (2)

[§3.2] §3.2 (Evolutionary Trigger Search): the description states that the search identifies triggers 'semantically aligned with their target concepts while remaining stable across other injected concepts,' yet supplies no fitness function, population size, generation count, or explicit interference penalty. Because the central claim of cross-concept stability under sequential decentralized injection rests on this optimization step, the missing details prevent assessment of whether the procedure actually enforces the required invariance.
[§4] §4 (Multi-Concept Experiments): the reported ~95% ASR and 'strong clean generation' across 500 concept pairs are not accompanied by the precise ASR definition, the baselines against which it is compared, or the criteria used to select or exclude concept pairs. These omissions are load-bearing for the claim that Hydra avoids entanglement and maintains fidelity under dense multi-concept injection.

minor comments (2)

[Abstract / §4] The abstract and §4 use 'ASR' without an explicit expansion on first use in the main body; add the definition for clarity.
[Figure 1] Figure 1 (framework overview) would benefit from explicit arrows or labels indicating the flow from evolutionary search to the multi-task fine-tuning stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each major comment below and have revised the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [§3.2] §3.2 (Evolutionary Trigger Search): the description states that the search identifies triggers 'semantically aligned with their target concepts while remaining stable across other injected concepts,' yet supplies no fitness function, population size, generation count, or explicit interference penalty. Because the central claim of cross-concept stability under sequential decentralized injection rests on this optimization step, the missing details prevent assessment of whether the procedure actually enforces the required invariance.

Authors: We appreciate this observation. The original manuscript provided a high-level overview of the evolutionary trigger search to emphasize the framework's novelty. To enable full evaluation and reproducibility, we will expand Section 3.2 with the complete details of the optimization procedure. This includes the fitness function, which balances semantic alignment to the target concept (via embedding similarity) with a stability term that penalizes interference with other concepts, along with the population size, number of generations, and the explicit interference penalty term. These additions will be included in the revised manuscript. revision: yes
Referee: [§4] §4 (Multi-Concept Experiments): the reported ~95% ASR and 'strong clean generation' across 500 concept pairs are not accompanied by the precise ASR definition, the baselines against which it is compared, or the criteria used to select or exclude concept pairs. These omissions are load-bearing for the claim that Hydra avoids entanglement and maintains fidelity under dense multi-concept injection.

Authors: We agree that these details are essential for interpreting the results. In the revised version, we will provide a precise definition of ASR as the proportion of test prompts containing the trigger that successfully generate the target concept. We will also specify the baselines used, such as independent single-concept injections and multi-concept fine-tuning without the proposed regularization. Additionally, we will detail the concept pair selection process, which involved sampling from a diverse set of concepts while avoiding highly similar pairs to minimize natural semantic conflicts. These clarifications will be added to Section 4. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper proposes Hydra as a framework combining evolutionary trigger search in text-encoder space with multi-task fine-tuning and trigger-clean regularization. Claims of ~95% ASR and preserved clean fidelity across 8 attackers and 500 concept pairs are presented as outcomes of reported experiments rather than reductions of any equation to its own fitted inputs or self-citations. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing uniqueness theorems appear in the provided text; the stability assertions rest on external empirical measurements instead of internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about semantic stability in text encoder space and the ability of regularization to coordinate cross-task interactions; no explicit free parameters or invented entities are detailed in the abstract.

axioms (2)

domain assumption Triggers can be found in text encoder space that align with target concepts yet remain stable across multiple injected concepts.
This is the core premise enabling the evolutionary search component.
domain assumption Multi-task fine-tuning combined with trigger-clean regularization improves stability under dense injection.
Invoked to justify the training procedure for preventing entanglement.

pith-pipeline@v0.9.0 · 5803 in / 1293 out tokens · 34111 ms · 2026-05-20T04:35:10.870017+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness, washburn_uniqueness_aczel) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hydra performs evolutionary trigger search in the text encoder space to identify triggers that are semantically aligned with their target concepts while remaining stable across other injected concepts. It further combines multi-task fine-tuning with trigger-clean regularization
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

implosion manifests as instability in multi-concept backdoor learning, where competing semantic objectives pull concepts toward incompatible targets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

[1]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022
[2]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” 2022. [Online]. Available: https://arxiv.org/abs/2204.06125

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Stable diffusion 1,

S. AI, “Stable diffusion 1,” https://stability.ai/news/ stable-diffusion-announcement, 2022

work page 2022
[5]

Virtually try on clothes with a new ai shopping feature,

L. Rincon, “Virtually try on clothes with a new ai shopping feature,” Google Blog, 2023

work page 2023
[6]

Invasive diffusion: How one unwilling illustrator found herself turned into an ai model,

A. Baio, “Invasive diffusion: How one unwilling illustrator found herself turned into an ai model,” 2022, http://waxy.org

work page 2022
[7]

Exposing fake images generated by text-to-image diffusion models,

Q. Xu, H. Wang, L. Meng, Z. Mi, J. Yuan, and H. Yan, “Exposing fake images generated by text-to-image diffusion models,”Pattern Recognition Letters, vol. 176, pp. 76–82, 2023

work page 2023
[8]

Explaining the sdxl latent space

Huggingface, “Explaining the sdxl latent space.” https://huggingface.co/, 2023

work page 2023
[9]

Perturbing attention gives you more bang for the buck: Subtle imaging perturbations that efficiently fool customized diffusion models,

J. Xu, Y . Lu, Y . Li, S. Lu, D. Wang, and X. Wei, “Perturbing attention gives you more bang for the buck: Subtle imaging perturbations that efficiently fool customized diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 24 534–24 543

work page 2024
[10]

How to backdoor diffusion models?

S.-Y . Chou, P.-Y . Chen, and T.-Y . Ho, “How to backdoor diffusion models?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4015–4024

work page 2023
[11]

The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copyright breaches without adjusting finetuning pipeline,

H. Wang, Q. Shen, Y . Tong, Y . Zhang, and K. Kawaguchi, “The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copyright breaches without adjusting finetuning pipeline,”arXiv preprint arXiv:2401.04136, 2024

work page arXiv 2024
[12]

Transtroj: Transferable backdoor attacks to pre-trained models via embedding indistinguishability,

H. Wang, T. Xiang, S. Guo, J. He, H. Liu, and T. Zhang, “Transtroj: Transferable backdoor attacks to pre-trained models via embedding indistinguishability,”arXiv preprint arXiv:2401.15883, 2024

work page arXiv 2024
[13]

arXiv preprint arXiv:2302.07944 , year=

B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov, “Ef- fective data augmentation with diffusion models,”arXiv preprint arXiv:2302.07944, 2023

work page arXiv 2023
[14]

Civitai,

Civitai, “Civitai,” https://github.com/civitai/civitai, 2022

work page 2022
[15]

Huggingface,

T. A. Vass., “Huggingface,” https://huggingface.co/blog/ TimothyAlexisVass/explaining-the-sdxl-latent-space, 2022

work page 2022
[16]

Understanding implosion in text-to-image generative models,

W. Ding, C. Y . Li, S. Shan, B. Y . Zhao, and H. Zheng, “Understanding implosion in text-to-image generative models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1211–1225

work page 2024
[17]

Villandiffusion: A unified backdoor attack framework for diffusion models,

S.-Y . Chou, P.-Y . Chen, and T.-Y . Ho, “Villandiffusion: A unified backdoor attack framework for diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 33 912–33 964, 2023

work page 2023
[18]

Nightshade: Prompt-specific poisoning attacks on text-to-image gener- ative models,

S. Shan, W. Ding, J. Passananti, S. Wu, H. Zheng, and B. Y . Zhao, “Nightshade: Prompt-specific poisoning attacks on text-to-image gener- ative models,” in2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024, pp. 807–825

work page 2024
[19]

Eviledit: Backdooring text-to-image diffusion models in one second,

H. Wang, S. Guo, J. He, K. Chen, S. Zhang, T. Zhang, and T. Xiang, “Eviledit: Backdooring text-to-image diffusion models in one second,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 3657–3665

work page 2024
[20]

Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis,

L. Struppek, D. Hintersdorf, and K. Kersting, “Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4584–4596

work page 2023
[21]

Bagm: A backdoor attack for manipulating text-to-image generative models,

J. Vice, N. Akhtar, R. Hartley, and A. Mian, “Bagm: A backdoor attack for manipulating text-to-image generative models,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 4865–4880, 2024

work page 2024
[22]

Poisoning language models during instruction tuning,

A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 35 413–35 425

work page 2023
[23]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024

work page 2024
[24]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020
[25]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,”arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[26]

Cogview: Mastering text-to-image generation via transformers,

M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yanget al., “Cogview: Mastering text-to-image generation via transformers,”Advances in neural information processing systems, vol. 34, pp. 19 822–19 835, 2021

work page 2021
[27]

Neural distributed image compression with cross-attention feature alignment,

N. Mital, E. ¨Ozyilkan, A. Garjani, and D. G ¨und¨uz, “Neural distributed image compression with cross-attention feature alignment,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2498–2507

work page 2023
[28]

Prompt-to-Prompt Image Editing with Cross Attention Control

A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y . Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,”arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Text-image alignment for diffusion-based perception,

N. Kondapaneni, M. Marks, M. Knott, R. Guimaraes, and P. Perona, “Text-image alignment for diffusion-based perception,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 13 883–13 893

work page 2024
[30]

Towards understanding cross and self-attention in stable diffusion for text-guided image editing,

B. Liu, C. Wang, T. Cao, K. Jia, and J. Huang, “Towards understanding cross and self-attention in stable diffusion for text-guided image editing,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 7817–7826

work page 2024
[31]

Unleashing text-to-image diffusion models for visual perception,

W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5729–5739

work page 2023
[32]

Diffusion model with cross attention as an inductive bias for disentanglement,

T. Yang, C. Lan, Y . Lu, and N. Zheng, “Diffusion model with cross attention as an inductive bias for disentanglement,”Advances in Neural Information Processing Systems, vol. 37, pp. 82 465–82 492, 2024

work page 2024
[33]

Shadowcast: Stealthy data poisoning attacks against vision-language models,

Y . Xu, J. Yao, M. Shu, Y . Sun, Z. Wu, N. Yu, T. Goldstein, and F. Huang, “Shadowcast: Stealthy data poisoning attacks against vision-language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 57 733–57 764, 2024

work page 2024
[34]

{UnGANable}: Defending against{GAN-based}face manipulation,

Z. Li, N. Yu, A. Salem, M. Backes, M. Fritz, and Y . Zhang, “{UnGANable}: Defending against{GAN-based}face manipulation,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 7213–7230

work page 2023
[35]

Text-to-image diffusion models can be easily backdoored through multimodal data poisoning,

S. Zhai, Y . Dong, Q. Shen, S. Pu, Y . Fang, and H. Su, “Text-to-image diffusion models can be easily backdoored through multimodal data poisoning,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1577–1587

work page 2023
[36]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[37]

A survey on multi-task learning,

Y . Zhang and Q. Yang, “A survey on multi-task learning,”IEEE transactions on knowledge and data engineering, vol. 34, no. 12, pp. 5586–5609, 2021

work page 2021
[38]

Multi-task learning as multi-objective opti- mization,

O. Sener and V . Koltun, “Multi-task learning as multi-objective opti- mization,”Advances in neural information processing systems, vol. 31, 2018. 14

work page 2018
[39]

Regularized multi–task learning,

T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 109–117

work page 2004
[40]

Diffusionmtl: Learning multi-task denoising diffusion model from partially annotated data,

H. Ye and D. Xu, “Diffusionmtl: Learning multi-task denoising diffusion model from partially annotated data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 960–27 969

work page 2024
[41]

Multi- concept customization of text-to-image diffusion,

N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi- concept customization of text-to-image diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1931–1941

work page 2023
[42]

Vector quantized diffusion model for text-to-image synthesis,

S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 696–10 706

work page 2022
[43]

Addressing negative transfer in diffusion models,

H. Go, Y . Lee, S. Lee, S. Oh, H. Moon, and S. Choi, “Addressing negative transfer in diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 27 199–27 222, 2023

work page 2023
[44]

Vision transformer adapters for generalizable multitask learning,

D. Bhattacharjee, S. S ¨usstrunk, and M. Salzmann, “Vision transformer adapters for generalizable multitask learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 015–19 026

work page 2023
[45]

Graph optimal transport for cross-domain alignment,

L. Chen, Z. Gan, Y . Cheng, L. Li, L. Carin, and J. Liu, “Graph optimal transport for cross-domain alignment,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 1542–1553

work page 2020
[46]

Ai models collapse when trained on recursively generated data,

I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y . Gal, “Ai models collapse when trained on recursively generated data,” Nature, vol. 631, no. 8022, pp. 755–759, 2024

work page 2024
[47]

Gra- dient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020

work page 2020
[48]

Laion-aesthetics,

C. Schuhmann, “Laion-aesthetics,” https://laion.ai/blog/laionaesthetics/, 2022

work page 2022
[49]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755

work page 2014
[50]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2009, pp. 248–255

work page 2009
[51]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900

work page 2022
[52]

Toward verifiable and reproducible human eval- uation for text-to-image generation,

M. Otani, R. Togashi, Y . Sawai, R. Ishigami, Y . Nakashima, S. Satoh, Z. He, and S. Hirota, “Toward verifiable and reproducible human eval- uation for text-to-image generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 14 277–14 286

work page 2023
[53]

Clipscore: A reference-free evaluation metric for image captioning,

J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 7514–7528

work page 2021
[54]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,

V . W. Liang, Y . Zhang, Y . Kwon, S. Yeung, and J. Y . Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 17 612–17 625, 2022

work page 2022
[55]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017
[56]

T2ishield: Defending against backdoors on text-to-image diffusion models,

Z. Wang, J. Zhang, S. Shan, and X. Chen, “T2ishield: Defending against backdoors on text-to-image diffusion models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 107–124

work page 2024
[57]

Dynamic attention anal- ysis for backdoor detection in text-to-image diffusion models,

Z. Wang, J. Zhang, S. Shan, and X. Chen, “Dynamic attention anal- ysis for backdoor detection in text-to-image diffusion models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 15 Algorithm 1EVOLUTIONARYTRIGGERSEARCH 1:Input:Assigned concept pairsC +, negative concept pairs C−, rare-word vocabularyV, population sizeP, maximum generat...

work page 2025

[1] [1]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022

[2] [2]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” 2022. [Online]. Available: https://arxiv.org/abs/2204.06125

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Stable diffusion 1,

S. AI, “Stable diffusion 1,” https://stability.ai/news/ stable-diffusion-announcement, 2022

work page 2022

[5] [5]

Virtually try on clothes with a new ai shopping feature,

L. Rincon, “Virtually try on clothes with a new ai shopping feature,” Google Blog, 2023

work page 2023

[6] [6]

Invasive diffusion: How one unwilling illustrator found herself turned into an ai model,

A. Baio, “Invasive diffusion: How one unwilling illustrator found herself turned into an ai model,” 2022, http://waxy.org

work page 2022

[7] [7]

Exposing fake images generated by text-to-image diffusion models,

Q. Xu, H. Wang, L. Meng, Z. Mi, J. Yuan, and H. Yan, “Exposing fake images generated by text-to-image diffusion models,”Pattern Recognition Letters, vol. 176, pp. 76–82, 2023

work page 2023

[8] [8]

Explaining the sdxl latent space

Huggingface, “Explaining the sdxl latent space.” https://huggingface.co/, 2023

work page 2023

[9] [9]

Perturbing attention gives you more bang for the buck: Subtle imaging perturbations that efficiently fool customized diffusion models,

J. Xu, Y . Lu, Y . Li, S. Lu, D. Wang, and X. Wei, “Perturbing attention gives you more bang for the buck: Subtle imaging perturbations that efficiently fool customized diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 24 534–24 543

work page 2024

[10] [10]

How to backdoor diffusion models?

S.-Y . Chou, P.-Y . Chen, and T.-Y . Ho, “How to backdoor diffusion models?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4015–4024

work page 2023

[11] [11]

The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copyright breaches without adjusting finetuning pipeline,

H. Wang, Q. Shen, Y . Tong, Y . Zhang, and K. Kawaguchi, “The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copyright breaches without adjusting finetuning pipeline,”arXiv preprint arXiv:2401.04136, 2024

work page arXiv 2024

[12] [12]

Transtroj: Transferable backdoor attacks to pre-trained models via embedding indistinguishability,

H. Wang, T. Xiang, S. Guo, J. He, H. Liu, and T. Zhang, “Transtroj: Transferable backdoor attacks to pre-trained models via embedding indistinguishability,”arXiv preprint arXiv:2401.15883, 2024

work page arXiv 2024

[13] [13]

arXiv preprint arXiv:2302.07944 , year=

B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov, “Ef- fective data augmentation with diffusion models,”arXiv preprint arXiv:2302.07944, 2023

work page arXiv 2023

[14] [14]

Civitai,

Civitai, “Civitai,” https://github.com/civitai/civitai, 2022

work page 2022

[15] [15]

Huggingface,

T. A. Vass., “Huggingface,” https://huggingface.co/blog/ TimothyAlexisVass/explaining-the-sdxl-latent-space, 2022

work page 2022

[16] [16]

Understanding implosion in text-to-image generative models,

W. Ding, C. Y . Li, S. Shan, B. Y . Zhao, and H. Zheng, “Understanding implosion in text-to-image generative models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1211–1225

work page 2024

[17] [17]

Villandiffusion: A unified backdoor attack framework for diffusion models,

S.-Y . Chou, P.-Y . Chen, and T.-Y . Ho, “Villandiffusion: A unified backdoor attack framework for diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 33 912–33 964, 2023

work page 2023

[18] [18]

Nightshade: Prompt-specific poisoning attacks on text-to-image gener- ative models,

S. Shan, W. Ding, J. Passananti, S. Wu, H. Zheng, and B. Y . Zhao, “Nightshade: Prompt-specific poisoning attacks on text-to-image gener- ative models,” in2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024, pp. 807–825

work page 2024

[19] [19]

Eviledit: Backdooring text-to-image diffusion models in one second,

H. Wang, S. Guo, J. He, K. Chen, S. Zhang, T. Zhang, and T. Xiang, “Eviledit: Backdooring text-to-image diffusion models in one second,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 3657–3665

work page 2024

[20] [20]

Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis,

L. Struppek, D. Hintersdorf, and K. Kersting, “Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4584–4596

work page 2023

[21] [21]

Bagm: A backdoor attack for manipulating text-to-image generative models,

J. Vice, N. Akhtar, R. Hartley, and A. Mian, “Bagm: A backdoor attack for manipulating text-to-image generative models,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 4865–4880, 2024

work page 2024

[22] [22]

Poisoning language models during instruction tuning,

A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 35 413–35 425

work page 2023

[23] [23]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024

work page 2024

[24] [24]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020

[25] [25]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,”arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[26] [26]

Cogview: Mastering text-to-image generation via transformers,

M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yanget al., “Cogview: Mastering text-to-image generation via transformers,”Advances in neural information processing systems, vol. 34, pp. 19 822–19 835, 2021

work page 2021

[27] [27]

Neural distributed image compression with cross-attention feature alignment,

N. Mital, E. ¨Ozyilkan, A. Garjani, and D. G ¨und¨uz, “Neural distributed image compression with cross-attention feature alignment,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2498–2507

work page 2023

[28] [28]

Prompt-to-Prompt Image Editing with Cross Attention Control

A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y . Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,”arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Text-image alignment for diffusion-based perception,

N. Kondapaneni, M. Marks, M. Knott, R. Guimaraes, and P. Perona, “Text-image alignment for diffusion-based perception,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 13 883–13 893

work page 2024

[30] [30]

Towards understanding cross and self-attention in stable diffusion for text-guided image editing,

B. Liu, C. Wang, T. Cao, K. Jia, and J. Huang, “Towards understanding cross and self-attention in stable diffusion for text-guided image editing,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 7817–7826

work page 2024

[31] [31]

Unleashing text-to-image diffusion models for visual perception,

W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5729–5739

work page 2023

[32] [32]

Diffusion model with cross attention as an inductive bias for disentanglement,

T. Yang, C. Lan, Y . Lu, and N. Zheng, “Diffusion model with cross attention as an inductive bias for disentanglement,”Advances in Neural Information Processing Systems, vol. 37, pp. 82 465–82 492, 2024

work page 2024

[33] [33]

Shadowcast: Stealthy data poisoning attacks against vision-language models,

Y . Xu, J. Yao, M. Shu, Y . Sun, Z. Wu, N. Yu, T. Goldstein, and F. Huang, “Shadowcast: Stealthy data poisoning attacks against vision-language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 57 733–57 764, 2024

work page 2024

[34] [34]

{UnGANable}: Defending against{GAN-based}face manipulation,

Z. Li, N. Yu, A. Salem, M. Backes, M. Fritz, and Y . Zhang, “{UnGANable}: Defending against{GAN-based}face manipulation,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 7213–7230

work page 2023

[35] [35]

Text-to-image diffusion models can be easily backdoored through multimodal data poisoning,

S. Zhai, Y . Dong, Q. Shen, S. Pu, Y . Fang, and H. Su, “Text-to-image diffusion models can be easily backdoored through multimodal data poisoning,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1577–1587

work page 2023

[36] [36]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021

[37] [37]

A survey on multi-task learning,

Y . Zhang and Q. Yang, “A survey on multi-task learning,”IEEE transactions on knowledge and data engineering, vol. 34, no. 12, pp. 5586–5609, 2021

work page 2021

[38] [38]

Multi-task learning as multi-objective opti- mization,

O. Sener and V . Koltun, “Multi-task learning as multi-objective opti- mization,”Advances in neural information processing systems, vol. 31, 2018. 14

work page 2018

[39] [39]

Regularized multi–task learning,

T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 109–117

work page 2004

[40] [40]

Diffusionmtl: Learning multi-task denoising diffusion model from partially annotated data,

H. Ye and D. Xu, “Diffusionmtl: Learning multi-task denoising diffusion model from partially annotated data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 960–27 969

work page 2024

[41] [41]

Multi- concept customization of text-to-image diffusion,

N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi- concept customization of text-to-image diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1931–1941

work page 2023

[42] [42]

Vector quantized diffusion model for text-to-image synthesis,

S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 696–10 706

work page 2022

[43] [43]

Addressing negative transfer in diffusion models,

H. Go, Y . Lee, S. Lee, S. Oh, H. Moon, and S. Choi, “Addressing negative transfer in diffusion models,”Advances in Neural Information Processing Systems, vol. 36, pp. 27 199–27 222, 2023

work page 2023

[44] [44]

Vision transformer adapters for generalizable multitask learning,

D. Bhattacharjee, S. S ¨usstrunk, and M. Salzmann, “Vision transformer adapters for generalizable multitask learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 015–19 026

work page 2023

[45] [45]

Graph optimal transport for cross-domain alignment,

L. Chen, Z. Gan, Y . Cheng, L. Li, L. Carin, and J. Liu, “Graph optimal transport for cross-domain alignment,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 1542–1553

work page 2020

[46] [46]

Ai models collapse when trained on recursively generated data,

I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y . Gal, “Ai models collapse when trained on recursively generated data,” Nature, vol. 631, no. 8022, pp. 755–759, 2024

work page 2024

[47] [47]

Gra- dient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gra- dient surgery for multi-task learning,”Advances in neural information processing systems, vol. 33, pp. 5824–5836, 2020

work page 2020

[48] [48]

Laion-aesthetics,

C. Schuhmann, “Laion-aesthetics,” https://laion.ai/blog/laionaesthetics/, 2022

work page 2022

[49] [49]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755

work page 2014

[50] [50]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2009, pp. 248–255

work page 2009

[51] [51]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900

work page 2022

[52] [52]

Toward verifiable and reproducible human eval- uation for text-to-image generation,

M. Otani, R. Togashi, Y . Sawai, R. Ishigami, Y . Nakashima, S. Satoh, Z. He, and S. Hirota, “Toward verifiable and reproducible human eval- uation for text-to-image generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 14 277–14 286

work page 2023

[53] [53]

Clipscore: A reference-free evaluation metric for image captioning,

J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 7514–7528

work page 2021

[54] [54]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,

V . W. Liang, Y . Zhang, Y . Kwon, S. Yeung, and J. Y . Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 17 612–17 625, 2022

work page 2022

[55] [55]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017

[56] [56]

T2ishield: Defending against backdoors on text-to-image diffusion models,

Z. Wang, J. Zhang, S. Shan, and X. Chen, “T2ishield: Defending against backdoors on text-to-image diffusion models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 107–124

work page 2024

[57] [57]

Dynamic attention anal- ysis for backdoor detection in text-to-image diffusion models,

Z. Wang, J. Zhang, S. Shan, and X. Chen, “Dynamic attention anal- ysis for backdoor detection in text-to-image diffusion models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 15 Algorithm 1EVOLUTIONARYTRIGGERSEARCH 1:Input:Assigned concept pairsC +, negative concept pairs C−, rare-word vocabularyV, population sizeP, maximum generat...

work page 2025