arxiv: 2605.10180 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.CR

Recognition: 2 theorem links

· Lean Theorem

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

Chenyu Zhang , Lanjun Wang , Yueyang Cheng , Ruidong Chen , Wenhui Li , An-An Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:32 UTC · model grok-4.3

classification 💻 cs.CV cs.CR

keywords diffusion transformersrisky content detectionattention headscontent suppressiontext-to-image generationinference-time safeguardAHVconcept sensitivity

0 comments

The pith

Attention heads in diffusion transformers show concept-specific sensitivity that lets risky content be detected and suppressed at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion transformers generate images by entangling text semantics and visual synthesis inside joint attention layers, which prevents the clean separation of risky concepts that earlier U-Net methods relied on. The paper shows that individual attention heads respond with distinct sensitivity patterns to particular semantic concepts, turning those patterns into a reliable signature for identifying risky tokens. This signature is captured as an Attention Head Vector that is tracked across denoising steps and used to reduce attention weights on dangerous content in a momentum-based, head-specific way. If the property holds, DiT models can be protected against sexual, violent, and copyrighted outputs without any retraining and without harming overall image quality.

Core claim

The paper establishes that attention heads in Diffusion Transformers exhibit concept-specific sensitivity. This property is formalized by representing each textual token as an Attention Head Vector (AHV) that records its sensitivity profile across all heads. During inference a momentum-based tracker maintains token-wise AHVs over denoising steps, and a sensitivity-guided adaptive suppression step lowers the attention weights of tokens whose AHVs match risky concepts, all without model updates.

What carries the argument

The Attention Head Vector (AHV), a per-token sensitivity profile across attention heads that serves as a discriminative signature for detecting and suppressing risky generation tendencies.

If this is right

Sexual, violent, and copyright-protected content can be suppressed inside state-of-the-art DiT models at inference time without retraining.
The same head-sensitivity mechanism transfers across different DiT-based text-to-image architectures.
Robustness to adversarial prompts is achieved by dynamic momentum tracking of AHVs rather than static thresholds.
Image quality remains intact because suppression is applied only to identified risky attention weights in a head-specific manner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sensitivity-tracking idea could be tested on other transformer-based generative models such as video or audio diffusion transformers.
If AHV signatures prove stable, they might support broader concept-level editing operations beyond safety filtering.
Model developers could embed AHV monitoring as a lightweight default safeguard in future DiT releases.

Load-bearing premise

The sensitivity patterns that mark risky concepts stay stable enough across prompts and models to allow reliable detection and suppression without creating false positives or degrading image quality.

What would settle it

An experiment that finds a set of adversarial prompts whose risky tokens produce AHVs indistinguishable from safe tokens, or that shows clear drops in visual fidelity when suppression is applied to safe prompts, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.10180 by An-An Liu, Chenyu Zhang, Lanjun Wang, Ruidong Chen, Wenhui Li, Yueyang Cheng.

**Figure 2.** Figure 2: Traditional U-Net-based concept erasure methods aim to fine-tune the cross-attention (a) or self-attention (b) layers to [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of attention head sensitivity in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative sensitivity curve across all attention [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative study demonstrating Differential Head [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: (a) t-SNE visualization of Attention Head Vectors [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 8.** Figure 8: Visualization of sexual content suppression. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Token-wise risk scores across denoising steps, com [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of copyrighted style suppression. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of harmful content suppression. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of benign content generation. [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization of sexual content suppression in [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Hyper-Parameter Analysis on the suppression strength [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

read the original abstract

The rise of text-to-image (T2I) models has increasingly raised concerns regarding the generation of risky content, such as sexual, violent, and copyright-protected images, highlighting the need for effective safeguards within the models themselves. Although existing methods have been proposed to eliminate risky concepts from T2I models, they are primarily developed for earlier U-Net architectures, leaving the state-of-the-art Diffusion-Transformer-based T2I models inadequately protected. This gap stems from a fundamental architectural shift: Diffusion Transformers (DiTs) entangle semantic injection and visual synthesis via joint attention, which makes it difficult to isolate and erase risky content within the generation. To bridge this gap, we investigate how semantic concepts are represented in DiTs and discover that attention heads exhibit concept-specific sensitivity. This property enables both the detection and suppression of risky content. Building on this discovery, we propose AHV-D\&S, a training-free inference-time safeguard for image generation in DiTs. Specifically, AHV-D\&S quantifies each textual token's sensitivity across all attention heads as an Attention Head Vector (AHV), which serves as a discriminative signature for detecting risky generation tendencies. In the inference stage, we propose a momentum-based strategy to dynamically track token-wise AHVs across denoising steps, and a sensitivity-guided adaptive suppression strategy that suppresses the attention weights of identified risky tokens based on head-specific risk scores. Extensive experiments demonstrate that AHV-D\&S effectively suppresses sexual, copyrighted-style, and various harmful content while preserving visual quality, and further exhibits strong robustness against adversarial prompts and transferability across different DiT-based T2I models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows attention heads in DiTs carry concept-specific sensitivity that can be measured and used for inference-time suppression of risky tokens, but the stability of that signal across prompts and timesteps remains the main open question.

read the letter

The main point is that the authors identified concept-specific sensitivity in the attention heads of Diffusion Transformers and built a training-free method around it to detect and suppress risky content like sexual or violent material during generation. This targets a real gap, since prior safety work focused on U-Net models where text and image paths were easier to separate. In DiTs the joint attention entangles everything, so the authors had to work with head-level responses instead. They define an Attention Head Vector per token, track it with momentum across denoising steps, and then apply head-specific suppression on risky tokens. That approach is new and practical for current DiT-based T2I systems. The experiments they report show effective reduction of harmful outputs across several models while keeping visual quality intact, plus some robustness to adversarial prompts and transfer across architectures. Those results give the method a usable edge over retraining-based alternatives. The soft spot is the assumption that the sensitivity patterns hold steady. The stress-test note is fair here: early denoising steps start from noise, so semantic signals can be weak or shift, and prompt phrasing can change which heads light up. If the pattern is not invariant enough, the detector either misses risky tokens or starts suppressing benign ones, which would undermine the quality-preservation claim. The abstract mentions momentum tracking and adaptive suppression to handle dynamics, but without explicit tests on varied prompt structures or full-trajectory stability, it is hard to judge how reliable the method stays outside the fixed test sets. This work is for people building or hardening deployed text-to-image systems, especially those moving from U-Net to DiT backbones. A reader focused on practical safety layers would get concrete implementation ideas and baseline numbers. It deserves peer review because it fills an architectural gap with a lightweight, inference-only technique and reports results on multiple models. A referee could usefully pressure-test the invariance assumptions and check whether the reported gains hold under broader prompt distributions.

Referee Report

2 major / 2 minor

Summary. The paper claims that attention heads in Diffusion Transformers exhibit concept-specific sensitivity, enabling the definition of Attention Head Vectors (AHVs) as token-wise signatures for risky content detection. Building on this, it introduces the training-free AHV-D&S method, which uses momentum-based tracking of AHVs across denoising steps and sensitivity-guided adaptive suppression of attention weights for risky tokens. Extensive experiments are reported to show effective suppression of sexual, violent, and copyrighted content while preserving image quality, plus robustness to adversarial prompts and transferability across DiT models.

Significance. If the core empirical discovery and method hold under broader conditions, this would represent a meaningful advance in inference-time safety for state-of-the-art DiT-based text-to-image models, addressing the architectural gap versus prior U-Net-focused approaches and offering a practical, training-free intervention with potential for broader use in controlling generative outputs.

major comments (2)

[Abstract / Experiments] The central claim depends on AHV signatures being sufficiently invariant to prompt phrasing and denoising timestep. The abstract and described experiments use fixed prompt sets without reported tests for early timesteps (near-noise inputs), where semantic sensitivity may be weak or shifted; this directly risks unreliable detection or unintended suppression and must be addressed with targeted ablation or cross-timestep analysis.
[Method description] The momentum-based tracking and head-specific risk scores are load-bearing for the suppression strategy, yet the abstract provides no quantitative details on how AHV is computed (e.g., exact aggregation across heads or normalization), making it impossible to verify whether the approach is truly parameter-free or how it avoids degrading non-risky content.

minor comments (2)

[Abstract] The acronym AHV-D&S is used before its expansion; a brief parenthetical definition on first use would improve readability.
[Figures/Tables] Figure or table captions should explicitly state the metrics used for 'preserving visual quality' (e.g., FID, CLIP score) to allow direct comparison with baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract / Experiments] The central claim depends on AHV signatures being sufficiently invariant to prompt phrasing and denoising timestep. The abstract and described experiments use fixed prompt sets without reported tests for early timesteps (near-noise inputs), where semantic sensitivity may be weak or shifted; this directly risks unreliable detection or unintended suppression and must be addressed with targeted ablation or cross-timestep analysis.

Authors: We appreciate this observation on the importance of timestep invariance for reliable detection. The manuscript reports robustness experiments across diverse prompts and applies the full AHV-D&S pipeline throughout the denoising trajectory, with momentum tracking intended to stabilize signatures. However, we did not include a dedicated cross-timestep ablation isolating early timesteps. We will add a new analysis subsection with AHV cosine-similarity matrices across timesteps and an ablation measuring detection/suppression accuracy when intervening only at early, middle, or late stages. revision: yes
Referee: [Method description] The momentum-based tracking and head-specific risk scores are load-bearing for the suppression strategy, yet the abstract provides no quantitative details on how AHV is computed (e.g., exact aggregation across heads or normalization), making it impossible to verify whether the approach is truly parameter-free or how it avoids degrading non-risky content.

Authors: We agree the abstract is too concise on these mechanics. Section 3.2 defines AHV as the per-token vector of attention weights across heads, aggregated by mean and L2-normalized; the risk score is the dot product against a fixed concept vector derived from a small set of reference prompts; suppression is applied adaptively per head using a sensitivity-derived threshold. Momentum is a simple exponential moving average (decay 0.9) with no learned parameters. We have expanded the abstract with a single sentence summarizing AHV construction and the adaptive suppression rule to improve verifiability while preserving the training-free claim. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical discovery of attention-head sensitivity is independent of AHV-D&S construction

full rationale

The paper's central claim is an empirical observation that attention heads in DiTs exhibit concept-specific sensitivity, discovered via investigation of semantic representations. This observation directly motivates the definition of AHV as a quantification of token-wise sensitivity across heads, followed by momentum tracking and adaptive suppression at inference time. No equations, fitted parameters, or self-citations reduce the discovery or the method to its own inputs by construction; the approach is presented as training-free and validated through experiments on suppression effectiveness. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical observation that attention heads show concept-specific sensitivity and on the effectiveness of the AHV-based suppression strategy.

axioms (1)

domain assumption Attention heads in Diffusion Transformers exhibit concept-specific sensitivity to semantic tokens
Presented as the key discovery that enables detection and suppression.

invented entities (1)

Attention Head Vector (AHV) no independent evidence
purpose: Quantify each textual token's sensitivity across attention heads as a signature for risky content
New construct introduced to operationalize the sensitivity observation.

pith-pipeline@v0.9.0 · 5612 in / 1193 out tokens · 38632 ms · 2026-05-12T04:32:10.137925+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

attention heads exhibit concept-specific sensitivity... quantified... as an Attention Head Vector (AHV)... momentum-based strategy to dynamically track token-wise AHVs across denoising steps, and a sensitivity-guided adaptive suppression strategy
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

t-SNE visualization of AHVs... distinct cluster structures... 98.73% classification accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

[1]

Praneeth Bedapudi. 2019. NudeNet: Neural Nets for Nudity Detection and Cen- soring. https://github.com/bedapudi6788/NudeNet. Python package, version 1.1.0

work page 2019
[2]

Black Forest Labs. 2025. FLUX.1 [dev] — Model Card. https://huggingface.co/ black-forest-labs/FLUX.1-dev

work page 2025
[3]

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. 2025. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qi Cai, Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. 2025. HiDream-I1: An Open-Source High-Efficient Image Generative Foundation Model. InProceedings of the 33rd ACM International Conference on Multimedia. 13636–13639

work page 2025
[5]

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhong- dao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. 2023. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426(2023)

work page internal anchor Pith review arXiv 2023
[6]

Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, and An- An Liu. 2025. Trce: Towards reliable malicious concept erasure in text-to-image diffusion models.arXiv preprint arXiv:2503.07389(2025)

work page arXiv 2025
[7]

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. 2025. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255

work page 2009
[9]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in neural information processing systems34 (2021), 8780–8794

work page 2021
[10]

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first International Conference on Machine Learning

work page 2024
[11]

Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau

work page
[12]

InProceedings of the IEEE/CVF international conference on computer vision

Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international conference on computer vision. 2426–2436

work page
[14]

Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, and David Bau. 2024. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5111–5120

work page 2024
[15]

Daiheng Gao, Shilin Lu, Wenbo Zhou, Jiaming Chu, Jie Zhang, Mengxi Jia, Bang Zhang, Zhaoxin Fan, and Weiming Zhang. 2025. Eraseanything: Enabling concept erasure in rectified flow transformers. InForty-second International Conference on Machine Learning

work page 2025
[16]

Chao Gong, Kai Chen, Zhipeng Wei, Jingjing Chen, and Yu-Gang Jiang. 2024. Re- liable and efficient concept erasure of text-to-image diffusion models. InEuropean Conference on Computer Vision. Springer, 73–88

work page 2024
[17]

Google. 2026. Gemini. https://gemini.google.com/app. Accessed: 2026-04-29

work page 2026
[18]

Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, and Duen Horng Chau. 2025. Conceptattention: Diffusion transformers learn highly interpretable features.arXiv preprint arXiv:2502.04320(2025)

work page arXiv 2025
[19]

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing. 7514– 7528

work page 2021
[21]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems30 (2017)

work page 2017
[22]

Internet Watch Foundation. 2026. Harm without limits: AI child sexual abuse material through the eyes of our Analysts. https://www.iwf.org.uk/about- us/why-we-exist/our-research/how-ai-is-being-abused-to-create-child- sexual-abuse-imagery/. Accessed: 2026-04-18

work page 2026
[23]

Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, and Yuki Mitsufuji. 2024. Trasce: Trajectory steering for concept erasure.arXiv preprint arXiv:2412.07658(2024)

work page arXiv 2024
[24]

Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. 2023. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22691– 22702

work page 2023
[25]

Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, and Fuli Feng. 2025. Speed: Scalable, precise, and efficient concept erasure for diffusion models.arXiv preprint arXiv:2503.07392(2025)

work page arXiv 2025
[26]

Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, and Wenyuan Xu. 2024. SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models.arXiv preprint arXiv:2404.06666(2024)

work page arXiv 2024
[27]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

work page 2014
[28]

Runtao Liu, Chen I Chieh, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, and Fabio Pizzati. 2024. Safetydpo: Scalable safety alignment for text-to-image generation.arXiv preprint arXiv:2412.10493(2024)

work page arXiv 2024
[29]

Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. 2024. Mace: Mass concept erasure in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6430–6440

work page 2024
[30]

Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, and Kwan-Yee K Wong. 2025. Rethinking cross-modal interaction in multimodal diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5934–5943

work page 2025
[31]

Jungwon Park, Jungmin Ko, Dongnam Byun, Jangwon Suh, and Wonjong Rhee

work page
[32]

InThe Thirteenth International Conference on Learning Representations

Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models. InThe Thirteenth International Conference on Learning Representations

work page
[33]

Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, and Gayoung Lee. 2024. Direct unlearning optimiza- tion for robust and safe text-to-image models.arXiv preprint arXiv:2407.21035 (2024)

work page arXiv 2024
[34]

Peigui Qi, Kunsheng Tang, Wenbo Zhou, Weiming Zhang, Nenghai Yu, Tianwei Zhang, Qing Guo, and Jie Zhang. 2025. SafeGuider: Robust and practical content safety control for text-to-image models. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security. 2818–2832

work page 2025
[35]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

work page 2022
[36]

Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting

work page
[37]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22522–22531

work page
[38]

Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, Jia-You Chen, Bo Li, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. 2023. Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?arXiv preprint arXiv:2310.10012 (2023)

work page arXiv 2023
[39]

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng- ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al . 2025. Qwen-image technical report.arXiv preprint arXiv:2508.02324(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, and Tongliang Liu. 2026. When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance.arXiv preprint arXiv:2602.20880(2026)

work page arXiv 2026
[41]

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. 2025. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564(2025)

work page internal anchor Pith review arXiv 2025
[42]

Fuyi Yang, Chenyu Zhang, and Lanjun Wang. 2025. Culture-based Adversar- ial Attack on Text-to-Image Models. In2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6

work page 2025
[43]

Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu

work page
[44]

InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

MMA-Diffusion: MultiModal Attack on Diffusion Models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page
[45]

Jaehong Yoon, Shoubin Yu, Vaidehi Patil, Huaxiu Yao, and Mohit Bansal. 2024. Safree: Training-free and adaptive guard for safe text-to-image and video gener- ation.arXiv preprint arXiv:2410.12761(2024)

work page arXiv 2024
[46]

Chenyu Zhang, Yiwen Ma, Lanjun Wang, Wenhui Li, Yi Tu, and An-An Liu. 2025. Metaphor-based jailbreaking attacks on text-to-image models.arXiv preprint arXiv:2512.10766(2025)

work page arXiv 2025
[47]

Chenyu Zhang, Lanjun Wang, Yiwen Ma, Wenhui Li, Guoqing Jin, and Anan Liu

work page
[48]

In Proceedings of the AAAI Conference on Artificial Intelligence, Vol

Reason2attack: Jailbreaking text-to-image models via llm reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 36030–36038

work page
[49]

Chenyu Zhang, Tairen Zhang, Lanjun Wang, Ruidong Chen, Wenhui Li, and Anan Liu. 2026. T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 36039–36047

work page 2026
[50]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

work page
[51]

InProceedings of the IEEE conference on computer vision and pattern recognition

The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

work page
[52]

Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu. 2024. To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now.European Conference on Computer Vision (ECCV)(2024)

work page 2024