pith. machine review for the scientific record. sign in

arxiv: 2605.10180 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.CR

Recognition: 2 theorem links

· Lean Theorem

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:32 UTC · model grok-4.3

classification 💻 cs.CV cs.CR
keywords diffusion transformersrisky content detectionattention headscontent suppressiontext-to-image generationinference-time safeguardAHVconcept sensitivity
0
0 comments X

The pith

Attention heads in diffusion transformers show concept-specific sensitivity that lets risky content be detected and suppressed at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion transformers generate images by entangling text semantics and visual synthesis inside joint attention layers, which prevents the clean separation of risky concepts that earlier U-Net methods relied on. The paper shows that individual attention heads respond with distinct sensitivity patterns to particular semantic concepts, turning those patterns into a reliable signature for identifying risky tokens. This signature is captured as an Attention Head Vector that is tracked across denoising steps and used to reduce attention weights on dangerous content in a momentum-based, head-specific way. If the property holds, DiT models can be protected against sexual, violent, and copyrighted outputs without any retraining and without harming overall image quality.

Core claim

The paper establishes that attention heads in Diffusion Transformers exhibit concept-specific sensitivity. This property is formalized by representing each textual token as an Attention Head Vector (AHV) that records its sensitivity profile across all heads. During inference a momentum-based tracker maintains token-wise AHVs over denoising steps, and a sensitivity-guided adaptive suppression step lowers the attention weights of tokens whose AHVs match risky concepts, all without model updates.

What carries the argument

The Attention Head Vector (AHV), a per-token sensitivity profile across attention heads that serves as a discriminative signature for detecting and suppressing risky generation tendencies.

If this is right

  • Sexual, violent, and copyright-protected content can be suppressed inside state-of-the-art DiT models at inference time without retraining.
  • The same head-sensitivity mechanism transfers across different DiT-based text-to-image architectures.
  • Robustness to adversarial prompts is achieved by dynamic momentum tracking of AHVs rather than static thresholds.
  • Image quality remains intact because suppression is applied only to identified risky attention weights in a head-specific manner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sensitivity-tracking idea could be tested on other transformer-based generative models such as video or audio diffusion transformers.
  • If AHV signatures prove stable, they might support broader concept-level editing operations beyond safety filtering.
  • Model developers could embed AHV monitoring as a lightweight default safeguard in future DiT releases.

Load-bearing premise

The sensitivity patterns that mark risky concepts stay stable enough across prompts and models to allow reliable detection and suppression without creating false positives or degrading image quality.

What would settle it

An experiment that finds a set of adversarial prompts whose risky tokens produce AHVs indistinguishable from safe tokens, or that shows clear drops in visual fidelity when suppression is applied to safe prompts, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.10180 by An-An Liu, Chenyu Zhang, Lanjun Wang, Ruidong Chen, Wenhui Li, Yueyang Cheng.

Figure 1
Figure 1. Figure 1: Adversarial prompts implicitly convey the risky [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Traditional U-Net-based concept erasure methods aim to fine-tune the cross-attention (a) or self-attention (b) layers to [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of attention head sensitivity in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative sensitivity curve across all attention [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative study demonstrating Differential Head [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) t-SNE visualization of Attention Head Vectors [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of sexual content suppression. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Token-wise risk scores across denoising steps, com [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of copyrighted style suppression. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of harmful content suppression. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of benign content generation. [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of sexual content suppression in [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Hyper-Parameter Analysis on the suppression strength [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
read the original abstract

The rise of text-to-image (T2I) models has increasingly raised concerns regarding the generation of risky content, such as sexual, violent, and copyright-protected images, highlighting the need for effective safeguards within the models themselves. Although existing methods have been proposed to eliminate risky concepts from T2I models, they are primarily developed for earlier U-Net architectures, leaving the state-of-the-art Diffusion-Transformer-based T2I models inadequately protected. This gap stems from a fundamental architectural shift: Diffusion Transformers (DiTs) entangle semantic injection and visual synthesis via joint attention, which makes it difficult to isolate and erase risky content within the generation. To bridge this gap, we investigate how semantic concepts are represented in DiTs and discover that attention heads exhibit concept-specific sensitivity. This property enables both the detection and suppression of risky content. Building on this discovery, we propose AHV-D\&S, a training-free inference-time safeguard for image generation in DiTs. Specifically, AHV-D\&S quantifies each textual token's sensitivity across all attention heads as an Attention Head Vector (AHV), which serves as a discriminative signature for detecting risky generation tendencies. In the inference stage, we propose a momentum-based strategy to dynamically track token-wise AHVs across denoising steps, and a sensitivity-guided adaptive suppression strategy that suppresses the attention weights of identified risky tokens based on head-specific risk scores. Extensive experiments demonstrate that AHV-D\&S effectively suppresses sexual, copyrighted-style, and various harmful content while preserving visual quality, and further exhibits strong robustness against adversarial prompts and transferability across different DiT-based T2I models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that attention heads in Diffusion Transformers exhibit concept-specific sensitivity, enabling the definition of Attention Head Vectors (AHVs) as token-wise signatures for risky content detection. Building on this, it introduces the training-free AHV-D&S method, which uses momentum-based tracking of AHVs across denoising steps and sensitivity-guided adaptive suppression of attention weights for risky tokens. Extensive experiments are reported to show effective suppression of sexual, violent, and copyrighted content while preserving image quality, plus robustness to adversarial prompts and transferability across DiT models.

Significance. If the core empirical discovery and method hold under broader conditions, this would represent a meaningful advance in inference-time safety for state-of-the-art DiT-based text-to-image models, addressing the architectural gap versus prior U-Net-focused approaches and offering a practical, training-free intervention with potential for broader use in controlling generative outputs.

major comments (2)
  1. [Abstract / Experiments] The central claim depends on AHV signatures being sufficiently invariant to prompt phrasing and denoising timestep. The abstract and described experiments use fixed prompt sets without reported tests for early timesteps (near-noise inputs), where semantic sensitivity may be weak or shifted; this directly risks unreliable detection or unintended suppression and must be addressed with targeted ablation or cross-timestep analysis.
  2. [Method description] The momentum-based tracking and head-specific risk scores are load-bearing for the suppression strategy, yet the abstract provides no quantitative details on how AHV is computed (e.g., exact aggregation across heads or normalization), making it impossible to verify whether the approach is truly parameter-free or how it avoids degrading non-risky content.
minor comments (2)
  1. [Abstract] The acronym AHV-D&S is used before its expansion; a brief parenthetical definition on first use would improve readability.
  2. [Figures/Tables] Figure or table captions should explicitly state the metrics used for 'preserving visual quality' (e.g., FID, CLIP score) to allow direct comparison with baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract / Experiments] The central claim depends on AHV signatures being sufficiently invariant to prompt phrasing and denoising timestep. The abstract and described experiments use fixed prompt sets without reported tests for early timesteps (near-noise inputs), where semantic sensitivity may be weak or shifted; this directly risks unreliable detection or unintended suppression and must be addressed with targeted ablation or cross-timestep analysis.

    Authors: We appreciate this observation on the importance of timestep invariance for reliable detection. The manuscript reports robustness experiments across diverse prompts and applies the full AHV-D&S pipeline throughout the denoising trajectory, with momentum tracking intended to stabilize signatures. However, we did not include a dedicated cross-timestep ablation isolating early timesteps. We will add a new analysis subsection with AHV cosine-similarity matrices across timesteps and an ablation measuring detection/suppression accuracy when intervening only at early, middle, or late stages. revision: yes

  2. Referee: [Method description] The momentum-based tracking and head-specific risk scores are load-bearing for the suppression strategy, yet the abstract provides no quantitative details on how AHV is computed (e.g., exact aggregation across heads or normalization), making it impossible to verify whether the approach is truly parameter-free or how it avoids degrading non-risky content.

    Authors: We agree the abstract is too concise on these mechanics. Section 3.2 defines AHV as the per-token vector of attention weights across heads, aggregated by mean and L2-normalized; the risk score is the dot product against a fixed concept vector derived from a small set of reference prompts; suppression is applied adaptively per head using a sensitivity-derived threshold. Momentum is a simple exponential moving average (decay 0.9) with no learned parameters. We have expanded the abstract with a single sentence summarizing AHV construction and the adaptive suppression rule to improve verifiability while preserving the training-free claim. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical discovery of attention-head sensitivity is independent of AHV-D&S construction

full rationale

The paper's central claim is an empirical observation that attention heads in DiTs exhibit concept-specific sensitivity, discovered via investigation of semantic representations. This observation directly motivates the definition of AHV as a quantification of token-wise sensitivity across heads, followed by momentum tracking and adaptive suppression at inference time. No equations, fitted parameters, or self-citations reduce the discovery or the method to its own inputs by construction; the approach is presented as training-free and validated through experiments on suppression effectiveness. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical observation that attention heads show concept-specific sensitivity and on the effectiveness of the AHV-based suppression strategy.

axioms (1)
  • domain assumption Attention heads in Diffusion Transformers exhibit concept-specific sensitivity to semantic tokens
    Presented as the key discovery that enables detection and suppression.
invented entities (1)
  • Attention Head Vector (AHV) no independent evidence
    purpose: Quantify each textual token's sensitivity across attention heads as a signature for risky content
    New construct introduced to operationalize the sensitivity observation.

pith-pipeline@v0.9.0 · 5612 in / 1193 out tokens · 38632 ms · 2026-05-12T04:32:10.137925+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

  1. [1]

    Praneeth Bedapudi. 2019. NudeNet: Neural Nets for Nudity Detection and Cen- soring. https://github.com/bedapudi6788/NudeNet. Python package, version 1.1.0

  2. [2]

    Black Forest Labs. 2025. FLUX.1 [dev] — Model Card. https://huggingface.co/ black-forest-labs/FLUX.1-dev

  3. [3]

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. 2025. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699(2025)

  4. [4]

    Qi Cai, Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. 2025. HiDream-I1: An Open-Source High-Efficient Image Generative Foundation Model. InProceedings of the 33rd ACM International Conference on Multimedia. 13636–13639

  5. [5]

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhong- dao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. 2023. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426(2023)

  6. [6]

    Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, and An- An Liu. 2025. Trce: Towards reliable malicious concept erasure in text-to-image diffusion models.arXiv preprint arXiv:2503.07389(2025)

  7. [7]

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. 2025. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683(2025)

  8. [8]

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255

  9. [9]

    Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in neural information processing systems34 (2021), 8780–8794

  10. [10]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first International Conference on Machine Learning

  11. [11]

    Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau

  12. [12]

    InProceedings of the IEEE/CVF international conference on computer vision

    Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international conference on computer vision. 2426–2436

  13. [14]

    Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, and David Bau. 2024. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5111–5120

  14. [15]

    Daiheng Gao, Shilin Lu, Wenbo Zhou, Jiaming Chu, Jie Zhang, Mengxi Jia, Bang Zhang, Zhaoxin Fan, and Weiming Zhang. 2025. Eraseanything: Enabling concept erasure in rectified flow transformers. InForty-second International Conference on Machine Learning

  15. [16]

    Chao Gong, Kai Chen, Zhipeng Wei, Jingjing Chen, and Yu-Gang Jiang. 2024. Re- liable and efficient concept erasure of text-to-image diffusion models. InEuropean Conference on Computer Vision. Springer, 73–88

  16. [17]

    Google. 2026. Gemini. https://gemini.google.com/app. Accessed: 2026-04-29

  17. [18]

    Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, and Duen Horng Chau. 2025. Conceptattention: Diffusion transformers learn highly interpretable features.arXiv preprint arXiv:2502.04320(2025)

  18. [19]

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626(2022)

  19. [20]

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing. 7514– 7528

  20. [21]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems30 (2017)

  21. [22]

    Internet Watch Foundation. 2026. Harm without limits: AI child sexual abuse material through the eyes of our Analysts. https://www.iwf.org.uk/about- us/why-we-exist/our-research/how-ai-is-being-abused-to-create-child- sexual-abuse-imagery/. Accessed: 2026-04-18

  22. [23]

    Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, and Yuki Mitsufuji. 2024. Trasce: Trajectory steering for concept erasure.arXiv preprint arXiv:2412.07658(2024)

  23. [24]

    Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. 2023. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22691– 22702

  24. [25]

    Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, and Fuli Feng. 2025. Speed: Scalable, precise, and efficient concept erasure for diffusion models.arXiv preprint arXiv:2503.07392(2025)

  25. [26]

    Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, and Wenyuan Xu. 2024. SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models.arXiv preprint arXiv:2404.06666(2024)

  26. [27]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

  27. [28]

    Runtao Liu, Chen I Chieh, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, and Fabio Pizzati. 2024. Safetydpo: Scalable safety alignment for text-to-image generation.arXiv preprint arXiv:2412.10493(2024)

  28. [29]

    Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. 2024. Mace: Mass concept erasure in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6430–6440

  29. [30]

    Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, and Kwan-Yee K Wong. 2025. Rethinking cross-modal interaction in multimodal diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5934–5943

  30. [31]

    Jungwon Park, Jungmin Ko, Dongnam Byun, Jangwon Suh, and Wonjong Rhee

  31. [32]

    InThe Thirteenth International Conference on Learning Representations

    Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models. InThe Thirteenth International Conference on Learning Representations

  32. [33]

    Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, and Gayoung Lee. 2024. Direct unlearning optimiza- tion for robust and safe text-to-image models.arXiv preprint arXiv:2407.21035 (2024)

  33. [34]

    Peigui Qi, Kunsheng Tang, Wenbo Zhou, Weiming Zhang, Nenghai Yu, Tianwei Zhang, Qing Guo, and Jie Zhang. 2025. SafeGuider: Robust and practical content safety control for text-to-image models. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security. 2818–2832

  34. [35]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  35. [36]

    Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting

  36. [37]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22522–22531

  37. [38]

    Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, Jia-You Chen, Bo Li, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. 2023. Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?arXiv preprint arXiv:2310.10012 (2023)

  38. [39]

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng- ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al . 2025. Qwen-image technical report.arXiv preprint arXiv:2508.02324(2025)

  39. [40]

    Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, and Tongliang Liu. 2026. When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance.arXiv preprint arXiv:2602.20880(2026)

  40. [41]

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. 2025. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564(2025)

  41. [42]

    Fuyi Yang, Chenyu Zhang, and Lanjun Wang. 2025. Culture-based Adversar- ial Attack on Text-to-Image Models. In2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6

  42. [43]

    Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu

  43. [44]

    InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    MMA-Diffusion: MultiModal Attack on Diffusion Models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  44. [45]

    Jaehong Yoon, Shoubin Yu, Vaidehi Patil, Huaxiu Yao, and Mohit Bansal. 2024. Safree: Training-free and adaptive guard for safe text-to-image and video gener- ation.arXiv preprint arXiv:2410.12761(2024)

  45. [46]

    Chenyu Zhang, Yiwen Ma, Lanjun Wang, Wenhui Li, Yi Tu, and An-An Liu. 2025. Metaphor-based jailbreaking attacks on text-to-image models.arXiv preprint arXiv:2512.10766(2025)

  46. [47]

    Chenyu Zhang, Lanjun Wang, Yiwen Ma, Wenhui Li, Guoqing Jin, and Anan Liu

  47. [48]

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol

    Reason2attack: Jailbreaking text-to-image models via llm reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 36030–36038

  48. [49]

    Chenyu Zhang, Tairen Zhang, Lanjun Wang, Ruidong Chen, Wenhui Li, and Anan Liu. 2026. T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 36039–36047

  49. [50]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  50. [51]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

  51. [52]

    Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu. 2024. To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now.European Conference on Computer Vision (ECCV)(2024)