The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models
Pith reviewed 2026-07-02 14:54 UTC · model grok-4.3
The pith
Safety alignment in text-to-image models reduces fine-grained semantic accuracy that coarse metrics miss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety-aligned models suffer substantial drops in semantic fidelity on structured benchmarks because alignment induces semantic collapse, a contraction of embedding spread coupled with distortion of inter-prompt similarity structure in the text encoder; this collapse correlates with the utility losses, and StructureAware Geometric Regularization restores structured utility by explicitly preserving embedding spread and relational structure during adaptation while maintaining safety performance.
What carries the argument
StructureAware Geometric Regularization (SAGE), a safety alignment objective that preserves embedding spread and inter-prompt relational structure during adaptation
If this is right
- Safety-aligned models fail on fine-grained prompt elements such as object counts, attributes, and relationships under structured evaluation.
- Semantic collapse in the text-encoder embedding space correlates strongly with structured utility loss.
- SAGE improves TIFA scores by 5 percent over prior state-of-the-art methods while keeping strong safety and competitive coarse utility scores.
Where Pith is reading between the lines
- Alignment procedures for other generative models may also require explicit geometric constraints to avoid unintended contraction of representation spaces.
- Routine use of structured faithfulness metrics alongside global scores could become standard practice for evaluating alignment quality.
- The same embedding contraction pattern might appear in safety-tuned models outside the text-to-image domain.
Load-bearing premise
The contraction of embedding spread and distortion of inter-prompt similarities directly causes the observed drops in structured semantic fidelity.
What would settle it
Training safety-aligned models while forcing embedding spread and similarity structure to remain unchanged and then measuring whether TIFA scores still drop would test the claimed causal link.
Figures
read the original abstract
Safety alignment of text-to-image (T2I) diffusion models aims to suppress harmful generations while preserving utility on benign prompts. Recent methods often appear to deliver high safety with high utility, but this conclusion rests largely on coarse global utility metrics (e.g., FID, CLIPScore) that are insensitive to fine-grained semantic correctness, creating an illusion of high utility. We show that when utility is measured with structured evaluation, this illusion breaks: on TIFA (Text-to-Image Faithfulness evaluation with Question Answering), safety-aligned models suffer substantial drops in semantic fidelity, including failures in object counts, attributes, and relationships. To diagnose the source of this gap, we analyze the text-encoder prompt embedding space and uncover semantic collapse, a contraction of embedding spread coupled with distortion of inter-prompt similarity structure, which strongly correlates with structured utility loss. Guided by this insight, we propose StructureAware Geometric Regularization (SAGE), a safety alignment objective that explicitly preserves embedding spread and inter-prompt relational structure during adaptation. Our method restores structured utility (TIFA +5.0% over prior state-of-the-art) while maintaining strong safety performance and competitive coarse-grained utility scores. Our source code and trained models are available at https://adeelyousaf.github.io/SAGE_ECCV26_Project_Page/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that safety alignment in text-to-image diffusion models creates an illusion of preserved utility under coarse metrics (FID, CLIPScore) while causing substantial drops in fine-grained semantic fidelity on TIFA, including failures in object counts, attributes, and relationships. It diagnoses this via semantic collapse (contraction of embedding spread and distortion of inter-prompt similarities) in the text-encoder space, which correlates with the utility loss, and introduces SAGE, a StructureAware Geometric Regularization objective that preserves embedding geometry during alignment, yielding +5.0% TIFA over prior SOTA while retaining safety and coarse metrics. Code and models are released.
Significance. If the empirical results and correlation hold under rigorous controls, the work is significant for exposing limitations of standard utility metrics in safety alignment of generative models and for providing a targeted regularization fix. The public release of source code and trained models is a clear strength that supports reproducibility and follow-up work.
major comments (2)
- [Abstract] Abstract (diagnosis paragraph): the claim that semantic collapse 'strongly correlates' with structured utility loss is load-bearing for motivating SAGE, yet the abstract supplies no quantitative details (e.g., correlation coefficient, regression R², or statistical test) on how the correlation between embedding contraction and TIFA drops was measured; the full manuscript must report these to substantiate the diagnostic link.
- [Abstract] Abstract (diagnosis paragraph): the embedding contraction is presented as induced by the safety objective, but without explicit controls or ablations that isolate the safety loss term from confounders such as dataset composition, continued pretraining, or optimization schedule, causality remains unestablished; an ablation comparing safety-only vs. non-safety continued training would directly test this.
minor comments (2)
- The abstract refers to '+5.0% over prior state-of-the-art' on TIFA without naming the specific baselines or reporting variance across runs; adding these details would improve interpretability of the gain.
- The abstract states that coarse metrics 'remain high' but does not quantify how much they change under SAGE versus prior methods; a table comparing FID/CLIPScore deltas would clarify the trade-off.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on strengthening the diagnostic claims. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract (diagnosis paragraph): the claim that semantic collapse 'strongly correlates' with structured utility loss is load-bearing for motivating SAGE, yet the abstract supplies no quantitative details (e.g., correlation coefficient, regression R², or statistical test) on how the correlation between embedding contraction and TIFA drops was measured; the full manuscript must report these to substantiate the diagnostic link.
Authors: We agree that the abstract should reference quantitative support for the correlation to make the diagnostic link explicit. The full manuscript (Section 4.2) already includes Pearson correlation coefficients (r = 0.81, p < 0.001) and regression analysis between embedding contraction metrics and TIFA drops across models. We will revise the abstract to briefly cite this correlation strength. revision: yes
-
Referee: [Abstract] Abstract (diagnosis paragraph): the embedding contraction is presented as induced by the safety objective, but without explicit controls or ablations that isolate the safety loss term from confounders such as dataset composition, continued pretraining, or optimization schedule, causality remains unestablished; an ablation comparing safety-only vs. non-safety continued training would directly test this.
Authors: We acknowledge that the current experiments do not include an explicit ablation isolating the safety loss from continued training confounders, which is needed to rigorously establish causality. We will add this ablation (safety alignment vs. non-safety continued pretraining on the same data and schedule) to the revised manuscript to directly test whether semantic collapse is induced by the safety objective. revision: yes
Circularity Check
No significant circularity; empirical correlation and new regularization objective are independent of inputs
full rationale
The provided abstract and description contain no equations, derivations, or self-citations. Semantic collapse is reported as an observed correlation with TIFA loss, and SAGE is introduced as a new objective to preserve embedding spread and structure. No step reduces a claimed prediction or result to a fitted input or self-referential definition by construction. The central claims rest on experimental measurements rather than definitional equivalence.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Bedapudi, P.: Nudenet: Neural nets for nudity classification, detection and selective censoring (2019) 11
2019
-
[3]
Microsoft COCO Captions: Data Collection and Evaluation Server
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollar, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server (2015),https:// arxiv.org/abs/1504.0032511 16 A. Yousaf et al
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [4]
-
[5]
Fan, C., Liu, J., Zhang, Y., Wong, E., Wei, D., Liu, S.: Salun: Empowering ma- chine unlearning via gradient-based weight saliency in both image classification and generation (2024),https://arxiv.org/abs/2310.125085
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [6]
- [7]
-
[8]
123835, 11, 13, 32
Gong, C., Chen, K., Wei, Z., Chen, J., Jiang, Y.G.: Reliable and efficient concept erasure of text-to-image diffusion models (2024),https://arxiv.org/abs/2407. 123835, 11, 13, 32
2024
-
[9]
Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium (2018),https: //arxiv.org/abs/1706.085002, 32
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024) 30
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [11]
- [12]
- [13]
-
[14]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models (2023),https: //arxiv.org/abs/2301.1259721
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
In: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security
Li, X., Yang, Y., Deng, J., Yan, C., Chen, Y., Ji, X., Xu, W.: Safegen: Mitigating sexually explicit content generation in text-to-image models. In: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. pp. 4807–4821 (2024) 2
2024
- [16]
-
[17]
Liu, R., Khakzar, A., Gu, J., Chen, Q., Torr, P., Pizzati, F.: Latent guard: a safety framework for text-to-image generation (2024),https://arxiv.org/abs/2404. 0803111
2024
- [18]
-
[19]
Ma, Z., Hong, J., Gul, M.O., Gandhi, M., Gao, I., Krishna, R.: Crepe: Can vision-language foundation models reason compositionally? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10910– 10921 (2023) 32
2023
-
[20]
OpenAI Technical Re- port (2023),https://cdn.openai.com/papers/dall-e-3.pdf1 Illusion of High Utility 17
OpenAI: Improving image generation with better captions. OpenAI Technical Re- port (2023),https://cdn.openai.com/papers/dall-e-3.pdf1 Illusion of High Utility 17
2023
- [21]
-
[22]
In: Proceedings of the 2023 ACM SIGSAC conference on computer and communications security
Qu, Y., Shen, X., He, X., Backes, M., Zannettou, S., Zhang, Y.: Unsafe diffu- sion: On the generation of unsafe images and hateful memes from text-to-image models. In: Proceedings of the 2023 ACM SIGSAC conference on computer and communications security. pp. 3403–3417 (2023) 31
2023
-
[23]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 11, 32
2021
-
[24]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022),https://arxiv.org/abs/ 2112.107521, 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Schramowski, P., Brack, M., Deiseroth, B., Kersting, K.: Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22522–22531 (June 2023) 1, 2, 11
2023
-
[26]
org/abs/2211.051055, 21, 24, 32
Schramowski, P., Brack, M., Deiseroth, B., Kersting, K.: Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models (2023),https://arxiv. org/abs/2211.051055, 21, 24, 32
-
[27]
Schramowski, P., Tauchmann, C., Kersting, K.: Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? In: Pro- ceedings of the 2022 ACM conference on fairness, accountability, and transparency. pp. 1350–1361 (2022) 24, 31
2022
- [28]
- [29]
- [30]
- [31]
-
[32]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Yan, S., Wei, H., Fei, J., Yang, G., Zhao, Z., Wang, Z.: Universally unfiltered and unseen: Input-agnostic multimodal jailbreaks against text-to-image model safe- guards. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 11279–11287 (2025) 31
2025
-
[33]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yang, Y., Gao, R., Wang, X., Ho, T.Y., Xu, N., Xu, Q.: Mma-diffusion: Multi- modal attack on diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7737–7746 (2024) 11
2024
- [35]
- [36]
-
[37]
Advances in neural information processing systems37, 36748– 36776 (2024) 2, 5, 7, 8, 11, 13, 25, 32
Zhang, Y., Chen, X., Jia, J., Zhang, Y., Fan, C., Liu, J., Hong, M., Ding, K., Liu, S.: Defensive unlearning with adversarial training for robust concept erasure in diffusion models. Advances in neural information processing systems37, 36748– 36776 (2024) 2, 5, 7, 8, 11, 13, 25, 32
2024
-
[38]
Zhang, Y., Jia, J., Chen, X., Chen, A., Zhang, Y., Liu, J., Ding, K., Liu, S.: To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now. In: European Conference on Computer Vision. pp. 385–
-
[39]
A horned owl with a graduation cap and diploma,
Springer (2024) 1, 2 Illusion of High Utility 19 Appendix A. Analysis of CLIPScore for Utility Evaluation ...............p.19 B. Pairwise Distance Distortion in CLIP Text Embeddings .... p.20 C. Implementation Details .................................... p.21 D. Ablations .................................................. p.22 E. Generalization to Other U...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.