pith. sign in

arxiv: 2605.28137 · v1 · pith:OBTBAALAnew · submitted 2026-05-27 · 💻 cs.CV · cs.LG

No Safe Dose: How Training Data Drives Unsafe Image Generation

Pith reviewed 2026-06-29 13:38 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords text-to-image modelsunsafe contenttraining data compositionmodel safetydata curationtext encodersafety classifiers
0
0 comments X

The pith

The proportion of unsafe images in training data directly raises the rate of unsafe outputs from text-to-image models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates the effect of unsafe content in training data by building otherwise identical text-to-image models on datasets that vary only in the fraction of unsafe images, from zero to 9.6 percent. Output unsafety measured by four independent classifiers rises steadily with that fraction, and the proportion of unsafe images matters more than their absolute number. Even with no unsafe images in training, a baseline of 16.6 percent unsafe outputs remains, which drops when a safer text encoder is substituted. Safety filtering of the data produces no measurable drop in standard quality scores such as FID or CLIPscore. These controlled results indicate that data curation and text-encoder safety act as separate levers.

Core claim

Training the same text-to-image architecture on datasets that differ solely in the fraction of unsafe images produces a monotonic rise in unsafe model outputs, from 16.6 percent at zero contamination to 25.5 percent at five percent contamination; the operative variable is the proportion rather than the absolute count of unsafe training images, while a residual baseline risk persists even at zero contamination and is partly traceable to the frozen text encoder.

What carries the argument

The controlled dose-response relationship between the proportion of unsafe training images and measured output unsafety, isolated via factorial dataset construction.

If this is right

  • Safety filtering of training data lowers output unsafety without harming FID, CLIPscore, or ImageReward.
  • Swapping the text encoder for a safer variant reduces the zero-contamination baseline from 16.6 percent to 9.6 percent.
  • Data curation and text-encoder safety function as independent, additive interventions.
  • The proportion of unsafe images, rather than their total count, governs the safety outcome across dataset scales from 100K to 8M.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proportion-driven effect may appear in other generative modalities once comparable controlled datasets become available.
  • If model capabilities continue to grow, the residual baseline unsafety could interact with new compositional behaviors in ways the current experiments do not test.
  • Repeated safety filtering at both data and encoder stages might drive the floor still lower, but that combined regime lies outside the reported design.

Load-bearing premise

The datasets differ only in the fraction of unsafe images and the four safety classifiers give an unbiased reading of true output unsafety.

What would settle it

Generating images from models trained on increasing proportions of unsafe data and finding no corresponding rise in the fraction flagged unsafe by the classifiers would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.28137 by Felix Friedrich, Kristian Kersting, Lukas Helff, Niharika Hegde, Patrick Schramowski.

Figure 1
Figure 1. Figure 1: Models become unsafer over time. Unsafe generation rates rise across successive T2I model generations, with certain harm categories showing steeper increases. Preprint. arXiv:2605.28137v1 [cs.CV] 27 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Percentage of unsafe model outputs as a function of unsafe training data proportion. Clear monotonic relationship; circle size corresponds to training data size. ID Name N p U q ∆q Original/Reference C0 8M-1% 7.94M 1.21 96K 20.6 — Rate-controlled (p), fixed scale (N ≈ 8M) C1 8M-0% 7.94M 0.00 0 16.6 –4.0 C2 8M-5% 8.24M 5.00 412K 25.5 +4.9 C3 8M-10% 8.64M 9.60 829K 26.4 +5.8 Rate-controlled (p), scale sweep … view at source ↗
Figure 3
Figure 3. Figure 3: Cross-classifier unsafe rates (%) across seven training contamination conditions. Despite approx. 2× differences in absolute rates (due to different policies, coverage, strictness), all classifiers trace the same per-condition profile, illustrating the effect is independent of the specific classifier. O1 O2 O3 O4 O5 O6 O7 O8 O9 Safety Category 0% unsafe (7.94M) 5% unsafe (8.24M) 1.21% unsafe (7.94M, origin… view at source ↗
Figure 4
Figure 4. Figure 4: Category composition of unsafe outputs. Columns show fraction of unsafe outputs per safety category (O1–O9). O3 and O4 show the strongest sensitivity to training contamination. higher rates to C0 and C2. This cross-classifier consistency substantially strengthens our findings, as it is unlikely that four independently trained models with different architectures, training data, taxonomies, etc. would all ex… view at source ↗
Figure 5
Figure 5. Figure 5: Model scale ablation. Unsafe output rate (%) for 1.2B and 3.6B models on C1 (0%) and C0 (1.2%) training data conditions. Params C1 (0%) C0 (1.2%) 1.2B 16.6 20.6 3.6B 16.3 19.7 Impact of model scale. While scaling laws for diffusion trans￾formers [33] suggest that performance trends at smaller scales generalize to larger models, we explicitly ablate the influence of model capacity on safety behavior. To thi… view at source ↗
Figure 6
Figure 6. Figure 6: Training loss convergence. (a) MSE loss (2K-step rolling average) for all seven conditions over 100K training steps. All conditions converge rapidly and plateau after ∼50K steps. (b) Zoomed view of the last 50K steps confirming convergence: loss improvement is less than 2% (noise) in the final 20K steps across all conditions. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-classifier agreement on training data annotations. Agreement rates and Cohen’s κ between LlavaGuard (primary annotator) and three alternative safety classifiers on a shared subset of training images [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emph{only} in their fraction of unsafe images (0\% to 9.6\%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6\% at 0\% contamination to 25.5\% at 5\%. A factorial design reveals that the \emph{proportion}, not the absolute count, of unsafe training images is the operative variable. The 16.6\% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk -- confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6\%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that training text-to-image models on datasets differing only in the fraction of unsafe images (0% to 9.6%) at scales from 100K to 8M produces a monotonic rise in output unsafety (measured by four classifiers) from 16.6% at zero contamination to 25.5% at 5%. A factorial design isolates proportion (not absolute count) as the operative variable. A 16.6% baseline at zero contamination is attributed to other components such as the frozen text encoder; an ablation with SafeCLIP lowers this floor to 9.6% while the dose-response persists. Safety filtering incurs no measurable degradation in FID, CLIPscore, or ImageReward, implying data curation and text-encoder safety are complementary interventions.

Significance. If the empirical results hold after verification of dataset construction and classifier validity, the work supplies a controlled demonstration that unsafe training proportion directly drives output unsafety, with the factorial design and text-encoder ablation providing evidence that proportion is causal and that residual risks arise from other model components. The absence of quality trade-offs strengthens the practical implication that curation is an effective, low-cost intervention complementary to encoder-level fixes.

major comments (3)
  1. [Methods] Methods (dataset construction and factorial design): The claim that datasets 'differ only in their fraction of unsafe images' is load-bearing for the monotonic dose-response and 'proportion, not count' conclusion, yet the manuscript supplies no description of the curation procedure, caption generation, visual distribution matching, or controls for other statistics. Without these details, alternative explanations (e.g., correlated changes in caption style or image quality) cannot be excluded.
  2. [Results] Results (safety classifier evaluation): The reported rates (16.6% to 25.5%) and monotonic relationship rest on four independent classifiers, but no inter-classifier agreement statistics, calibration curves, or human validation against perceived unsafety are provided. This measurement gap directly affects the reliability of the baseline, the dose-response, and the text-encoder ablation results.
  3. [Ablations] Ablations (text encoder): The SafeCLIP ablation is cited to confirm the 16.6% floor arises from the text encoder, yet the manuscript does not report the precise experimental factors in the factorial design, sample sizes per cell, or any statistical test for the proportion-vs-count contrast. These omissions prevent assessment of whether the design isolates the claimed variable.
minor comments (1)
  1. [Abstract] The abstract states results across 'several dataset scales (100K to 8M)' but does not include a table or figure breaking down unsafety rates by scale; adding this would clarify whether the proportion effect is scale-invariant.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important gaps in methodological transparency and validation that we will address in revision. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Methods] Methods (dataset construction and factorial design): The claim that datasets 'differ only in their fraction of unsafe images' is load-bearing for the monotonic dose-response and 'proportion, not count' conclusion, yet the manuscript supplies no description of the curation procedure, caption generation, visual distribution matching, or controls for other statistics. Without these details, alternative explanations (e.g., correlated changes in caption style or image quality) cannot be excluded.

    Authors: We agree that the current manuscript does not provide sufficient detail on dataset construction to fully substantiate the claim that the datasets differ solely in unsafe-image fraction. Although the source datasets and filtering criteria are referenced, explicit descriptions of the curation pipeline, caption generation method, visual-distribution matching steps, and controls for confounding statistics (image quality, caption style, etc.) are missing. In the revised manuscript we will add a dedicated subsection in Methods that documents these procedures and the controls employed. revision: yes

  2. Referee: [Results] Results (safety classifier evaluation): The reported rates (16.6% to 25.5%) and monotonic relationship rest on four independent classifiers, but no inter-classifier agreement statistics, calibration curves, or human validation against perceived unsafety are provided. This measurement gap directly affects the reliability of the baseline, the dose-response, and the text-encoder ablation results.

    Authors: We acknowledge that the reliability of the four safety classifiers requires additional quantitative support. We will add inter-classifier agreement statistics (pairwise agreement rates and Cohen’s kappa) and any available calibration information to the revised Results section. Human validation against perceived unsafety was not performed in the original study; we will therefore note this as a limitation and reference the classifiers’ prior validation in the literature rather than claim new human-grounded evidence. revision: partial

  3. Referee: [Ablations] Ablations (text encoder): The SafeCLIP ablation is cited to confirm the 16.6% floor arises from the text encoder, yet the manuscript does not report the precise experimental factors in the factorial design, sample sizes per cell, or any statistical test for the proportion-vs-count contrast. These omissions prevent assessment of whether the design isolates the claimed variable.

    Authors: We agree that the factorial design must be described with greater precision. The revised manuscript will explicitly list the experimental factors, the number of samples per cell, and the statistical tests (including any regression or ANOVA results) used to demonstrate that proportion, rather than absolute count, drives the observed effect. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential reductions

full rationale

The paper reports direct experimental outcomes from training identical models on datasets that differ only in unsafe image fraction (0% to 9.6%), then measuring generated image unsafety via four classifiers. No equations, fitted parameters, or derivations appear in the supplied text. Claims such as the monotonic rise from 16.6% to 25.5% and the proportion-vs-count factorial result are presented as observed data points, not quantities defined in terms of themselves or reduced via self-citation. The text-encoder ablation is likewise an independent experimental contrast. Because the central results rest on external measurement rather than any internal definitional loop, the derivation chain (such as it is) is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions: that safety classifiers accurately quantify unsafety and that the constructed datasets isolate the unsafe fraction variable with no other differences.

axioms (2)
  • domain assumption Safety classifiers accurately measure image unsafety
    Evaluation of generated images relies on four independent safety classifiers to determine output unsafety rates.
  • domain assumption Datasets differ only in the fraction of unsafe images
    The experimental isolation of the proportion variable assumes all other dataset properties are held constant across conditions.

pith-pipeline@v0.9.1-grok · 5804 in / 1406 out tokens · 31006 ms · 2026-06-29T13:38:08.325571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Birhane and V

    A. Birhane and V . U. Prabhu. Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. IEEE, 2021

  2. [2]

    Birhane, V

    A. Birhane, V . U. Prabhu, and E. Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes.arXiv preprint arXiv:2110.01963, 2021

  3. [3]

    Birhane, S

    A. Birhane, S. Han, V . Boddeti, S. Luccioni, et al. Into the laion’s den: Investigating hate in multimodal datasets.Advances in Neural Information Processing Systems, 36, 2024

  4. [4]

    Black, M

    K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

  5. [5]

    Brack, F

    M. Brack, F. Friedrich, D. Hintersdorf, L. Struppek, P. Schramowski, and K. Kersting. SEGA: Instructing text-to-image models using semantic guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  6. [6]

    Brack, F

    M. Brack, F. Friedrich, K. Kornmeier, L. Tsaban, P. Schramowski, K. Kersting, and A. Passos. LEdits++: Limitless image editing using text-to-image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  7. [7]

    Brack, S

    M. Brack, S. Katakol, F. Friedrich, P. Schramowski, H. Ravi, K. Kersting, and A. Kale. How to train your text-to-image model: Evaluating design choices for synthetic training captions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, October 2025

  8. [8]

    Carlini, J

    N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V . Sehwag, F. Tramer, B. Balle, D. Ippolito, and E. Wallace. Extracting training data from diffusion models. In32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023. 11

  9. [9]

    Stable diffusion safety checker

    CompVis. Stable diffusion safety checker. https://huggingface.co/CompVis/ stable-diffusion-safety-checker , 2022. CLIP-based NSFW concept classifier shipped with Stable Diffusion

  10. [10]

    Esser, S

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorber, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning (ICML), 2024

  11. [11]

    AI Act: Regulatory Framework on Artificial Intelligence

    European Commission. AI Act: Regulatory Framework on Artificial Intelligence. https:// digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai , 2024. Regulation (EU) 2024/1689. Accessed: 2026-05-19

  12. [12]

    Friedrich, W

    F. Friedrich, W. Stammer, P. Schramowski, and K. Kersting. Revision transformers: Instructing language models to change their values. InEuropean Conference on Artificial Intelligence (ECAI), 2023

  13. [13]

    Friedrich, M

    F. Friedrich, M. Brack, L. Struppek, D. Hintersdorf, P. Schramowski, S. Luccioni, and K. Ker- sting. Auditing and instructing text-to-image generation models on fairness.AI and Ethics, 2024

  14. [14]

    Friedrich, S

    F. Friedrich, S. Tedeschi, P. Schramowski, M. Brack, R. Navigli, H. Nguyen, B. Li, and K. Ker- sting. LLMs lost in translation: M-ALERT uncovers cross-linguistic safety inconsistencies. In ICLR Workshop on Building Trust in Language Models and Applications, 2025

  15. [15]

    Friedrich, T

    F. Friedrich, T. G. Welsch, M. Brack, et al. Beyond overcorrection: Evaluating diversity in T2I models with DivBench.arXiv preprint arXiv:2507.03015, 2025

  16. [16]

    S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

  17. [17]

    Gandikota, J

    R. Gandikota, J. Materzy´nska, J. Fiotto-Kaufman, and D. Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  18. [18]

    Gandikota, H

    R. Gandikota, H. Orgad, Y . Belinkov, J. Materzy´nska, and D. Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

  19. [19]

    Gebru, J

    T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

  20. [20]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, M. Riviere, S. Pathak, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. URL https://arxiv.org/abs/ 2408.00118

  21. [21]

    Ghosh, H

    S. Ghosh, H. Frase, A. Williams, S. Luger, P. Röttger, F. Barez, S. McGregor, et al. MLCom- mons AILuminate: Introducing v1.0 of the AI risk and reliability benchmark.arXiv preprint arXiv:2503.05731, 2025

  22. [22]

    Nano banana (gemini 2.5 flash image): Multimodal image generation and editing

    Google. Nano banana (gemini 2.5 flash image): Multimodal image generation and editing. https://www.digitalocean.com/resources/articles/nano-banana, 2025. AI image generation and editing model within the Gemini 2.5 Flash system

  23. [23]

    M. Hall, L. van der Maaten, L. Gustafson, M. Jones, and A. Adcock. A systematic study of bias amplification.arXiv preprint arXiv:2201.11706, 2022

  24. [24]

    Härle, F

    R. Härle, F. Friedrich, M. Brack, S. Wäldchen, B. Deiseroth, P. Schramowski, and K. Kersting. Measuring and guiding monosemanticity. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  25. [25]

    Helff, F

    L. Helff, F. Friedrich, M. Brack, K. Kersting, and P. Schramowski. LlavaGuard: An open VLM- based framework for safeguarding vision datasets and models. InInternational Conference on Machine Learning (ICML), 2025. 12

  26. [26]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems, 30, 2017

  27. [27]

    Hintersdorf, L

    D. Hintersdorf, L. Struppek, M. Brack, F. Friedrich, P. Schramowski, and K. Kersting. Does CLIP know my face?Journal of Artificial Intelligence Research (JAIR), 2024

  28. [28]

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  29. [29]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  30. [30]

    Kumari, B

    N. Kumari, B. Zhang, S.-Y . Wang, E. Shechtman, R. Zhang, and J.-Y . Zhu. Ablating concepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  31. [31]

    K. Lee, H. Liu, M. Ryu, O. Watkins, Y . Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

  32. [32]

    G. Li, K. Chen, S. Zhang, J. Zhang, and T. Zhang. Art: Automatic red-teaming for text-to- image models to protect benign users. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  33. [33]

    L. Li, C. Chen, R. Qian, W. Hu, T.-J. Fu, J. Tong, X. Wang, B. Zhang, A. Schwing, W. Liu, and Y . Yang. Dit-air: Revisiting the efficiency of diffusion model architecture design in text to image generation.arXiv preprint arXiv:2503.10618, 2025

  34. [34]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, pages 740–755. Springer, 2014

  35. [35]

    S. Lu, Z. Wang, L. Li, Y . Liu, and A. W.-K. Kong. MACE: Mass concept erasure in diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  36. [36]

    A. S. Luccioni, C. Akiki, M. Mitchell, and Y . Jernite. Stable bias: Evaluating societal represen- tations in diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

  37. [37]

    Midjourney: Ai-based image generation system

    Midjourney, Inc. Midjourney: Ai-based image generation system. https://www.midjourney. com, 2025. Text-to-image model known for stylized and high-quality visual generation

  38. [38]

    Mundt, A

    M. Mundt, A. Ovalle, F. Friedrich, A. Pranav, S. Paul, et al. The cake that is intelligence and who gets to bake it: An AI analogy and its implications for participation.arXiv preprint arXiv:2502.03038, 2025

  39. [39]

    Nakamura, M

    T. Nakamura, M. Mishra, S. Tedeschi, Y . Chai, J. T. Stillerman, F. Friedrich, et al. Aurora- M: Open source continual pre-training for multilingual language and code. InInternational Conference on Computational Linguistics (COLING) Industry Track, 2025

  40. [40]

    Prx: Text-to-image generation via rectified flow transformer.HuggingFace blog,

    Photoroom. Prx: Text-to-image generation via rectified flow transformer.HuggingFace blog,

  41. [41]

    Available athttps://huggingface.co/Photoroom/prx-1024-t2i-beta

  42. [42]

    Poppi, T

    S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, and R. Cucchiara. Safe-clip: Removing nsfw concepts from vision-and-language models. InEuropean Conference on Computer Vision, pages 340–356. Springer, 2024

  43. [43]

    Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. InProceedings of the 2023 ACM SIGSAC conference on computer and communications security, pages 3403–3417, 2023. 13

  44. [44]

    Quaye, A

    J. Quaye, A. Parrish, O. Inel, C. Rastogi, H. R. Kirk, M. Kahng, E. Van Liemt, M. Bartolo, J. Tsang, J. White, et al. Adversarial nibbler: An open red-teaming method for identifying diverse harms in text-to-image generation. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 388–406, 2024

  45. [45]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  46. [46]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  47. [47]

    Rando, D

    J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramèr. Red-teaming the stable diffusion safety filter. InNeurIPS ML Safety Workshop, 2022

  48. [48]

    Reuel, A

    A. Reuel, A. Ghosh, J. Chim, A. Tran, Y . Long, J. Mickel, et al. Who evaluates AI’s social impacts? mapping coverage and gaps in first and third party evaluations. InInternational Conference on Machine Learning (ICML), 2026

  49. [49]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  50. [50]

    Röttger, G

    P. Röttger, G. Attanasio, F. Friedrich, J. Goldzycher, et al. MSTS: A multimodal safety test suite for vision-language models.arXiv preprint arXiv:2501.10057, 2025

  51. [51]

    Schramowski, C

    P. Schramowski, C. Tauchmann, and K. Kersting. Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1350–1361, 2022

  52. [52]

    Schramowski, M

    P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023

  53. [53]

    Schuhmann, R

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35: 25278–25294, 2022

  54. [54]

    Seshadri, S

    P. Seshadri, S. Singh, and Y . Elazar. The bias amplification paradox in text-to-image generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2024

  55. [55]

    Solaiman, Z

    I. Solaiman, Z. Talat, W. Agnew, L. Ahmad, D. Baker, S. L. Blodgett, C. Chen, H. Daumé III, J. Dodge, I. Duan, et al. Evaluating the social impact of generative AI systems in systems and society. InOxford Handbook on the Foundations and Regulation of Generative AI, 2023

  56. [56]

    Somepalli, V

    G. Somepalli, V . Singla, M. Goldblum, J. Geiping, and T. Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  57. [57]

    Startsev, A

    V . Startsev, A. Ustyuzhanin, A. Kirillov, D. Baranchuk, and S. Kastryulin. Alchemist: Turning public text-to-image data into generative gold. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025

  58. [58]

    Steed and A

    R. Steed and A. Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 701–713, 2021

  59. [59]

    Struppek, D

    L. Struppek, D. Hintersdorf, F. Friedrich, P. Schramowski, and K. Kersting. Exploiting cultural biases via homoglyphs in text-to-image synthesis.Journal of Artificial Intelligence Research (JAIR), 2023. 14

  60. [60]

    Tedeschi, F

    S. Tedeschi, F. Friedrich, P. Schramowski, K. Kersting, R. Navigli, H. Nguyen, and B. Li. ALERT: A comprehensive benchmark for assessing large language models’ safety through red teaming. InWorkshop on Red Teaming Generative AI Models, 2024

  61. [61]

    Wallace, M

    B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  62. [62]

    B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li. Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benc...

  63. [63]

    C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical report,

  64. [64]

    URLhttps://arxiv.org/abs/2508.02324

  65. [65]

    J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36, 2024

  66. [66]

    Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao. SneakyPrompt: Jailbreaking text-to-image generative models. InProceedings of the IEEE Symposium on Security and Privacy (S&P), 2024

  67. [67]

    W. Zeng, D. Kurniawan, R. Mullins, Y . Liu, T. Saha, D. Ike-Njoku, J. Gu, Y . Song, C. Xu, J. Zhou, et al. Shieldgemma 2: Robust and tractable image content moderation.arXiv preprint arXiv:2504.01081, 2025

  68. [68]

    Y . Zeng, K. Klyman, A. Zhou, Y . Yang, M. Pan, R. Jia, D. Song, P. Liang, and B. Li. Ai risk categorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024

  69. [69]

    Zhang, K

    E. Zhang, K. Wang, X. Xu, Z. Wang, and H. Shi. Forget-me-not: Learning to forget in text-to- image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

  70. [70]

    Zheng, L

    L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, J. E. Gonzalez, I. Stoica, C. Barrett, and Y . Sheng. Sglang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 15 Appendix A Dose–response modeling To summarize potential saturation in the dose–response...