No Safe Dose: How Training Data Drives Unsafe Image Generation

Felix Friedrich; Kristian Kersting; Lukas Helff; Niharika Hegde; Patrick Schramowski

arxiv: 2605.28137 · v1 · pith:OBTBAALAnew · submitted 2026-05-27 · 💻 cs.CV · cs.LG

No Safe Dose: How Training Data Drives Unsafe Image Generation

Felix Friedrich , Lukas Helff , Niharika Hegde , Patrick Schramowski , Kristian Kersting This is my paper

Pith reviewed 2026-06-29 13:38 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords text-to-image modelsunsafe contenttraining data compositionmodel safetydata curationtext encodersafety classifiers

0 comments

The pith

The proportion of unsafe images in training data directly raises the rate of unsafe outputs from text-to-image models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates the effect of unsafe content in training data by building otherwise identical text-to-image models on datasets that vary only in the fraction of unsafe images, from zero to 9.6 percent. Output unsafety measured by four independent classifiers rises steadily with that fraction, and the proportion of unsafe images matters more than their absolute number. Even with no unsafe images in training, a baseline of 16.6 percent unsafe outputs remains, which drops when a safer text encoder is substituted. Safety filtering of the data produces no measurable drop in standard quality scores such as FID or CLIPscore. These controlled results indicate that data curation and text-encoder safety act as separate levers.

Core claim

Training the same text-to-image architecture on datasets that differ solely in the fraction of unsafe images produces a monotonic rise in unsafe model outputs, from 16.6 percent at zero contamination to 25.5 percent at five percent contamination; the operative variable is the proportion rather than the absolute count of unsafe training images, while a residual baseline risk persists even at zero contamination and is partly traceable to the frozen text encoder.

What carries the argument

The controlled dose-response relationship between the proportion of unsafe training images and measured output unsafety, isolated via factorial dataset construction.

If this is right

Safety filtering of training data lowers output unsafety without harming FID, CLIPscore, or ImageReward.
Swapping the text encoder for a safer variant reduces the zero-contamination baseline from 16.6 percent to 9.6 percent.
Data curation and text-encoder safety function as independent, additive interventions.
The proportion of unsafe images, rather than their total count, governs the safety outcome across dataset scales from 100K to 8M.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proportion-driven effect may appear in other generative modalities once comparable controlled datasets become available.
If model capabilities continue to grow, the residual baseline unsafety could interact with new compositional behaviors in ways the current experiments do not test.
Repeated safety filtering at both data and encoder stages might drive the floor still lower, but that combined regime lies outside the reported design.

Load-bearing premise

The datasets differ only in the fraction of unsafe images and the four safety classifiers give an unbiased reading of true output unsafety.

What would settle it

Generating images from models trained on increasing proportions of unsafe data and finding no corresponding rise in the fraction flagged unsafe by the classifiers would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.28137 by Felix Friedrich, Kristian Kersting, Lukas Helff, Niharika Hegde, Patrick Schramowski.

**Figure 1.** Figure 1: Models become unsafer over time. Unsafe generation rates rise across successive T2I model generations, with certain harm categories showing steeper increases. Preprint. arXiv:2605.28137v1 [cs.CV] 27 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Percentage of unsafe model outputs as a function of unsafe training data proportion. Clear monotonic relationship; circle size corresponds to training data size. ID Name N p U q ∆q Original/Reference C0 8M-1% 7.94M 1.21 96K 20.6 — Rate-controlled (p), fixed scale (N ≈ 8M) C1 8M-0% 7.94M 0.00 0 16.6 –4.0 C2 8M-5% 8.24M 5.00 412K 25.5 +4.9 C3 8M-10% 8.64M 9.60 829K 26.4 +5.8 Rate-controlled (p), scale sweep … view at source ↗

**Figure 3.** Figure 3: Cross-classifier unsafe rates (%) across seven training contamination conditions. Despite approx. 2× differences in absolute rates (due to different policies, coverage, strictness), all classifiers trace the same per-condition profile, illustrating the effect is independent of the specific classifier. O1 O2 O3 O4 O5 O6 O7 O8 O9 Safety Category 0% unsafe (7.94M) 5% unsafe (8.24M) 1.21% unsafe (7.94M, origin… view at source ↗

**Figure 4.** Figure 4: Category composition of unsafe outputs. Columns show fraction of unsafe outputs per safety category (O1–O9). O3 and O4 show the strongest sensitivity to training contamination. higher rates to C0 and C2. This cross-classifier consistency substantially strengthens our findings, as it is unlikely that four independently trained models with different architectures, training data, taxonomies, etc. would all ex… view at source ↗

**Figure 5.** Figure 5: Model scale ablation. Unsafe output rate (%) for 1.2B and 3.6B models on C1 (0%) and C0 (1.2%) training data conditions. Params C1 (0%) C0 (1.2%) 1.2B 16.6 20.6 3.6B 16.3 19.7 Impact of model scale. While scaling laws for diffusion transformers [33] suggest that performance trends at smaller scales generalize to larger models, we explicitly ablate the influence of model capacity on safety behavior. To thi… view at source ↗

**Figure 6.** Figure 6: Training loss convergence. (a) MSE loss (2K-step rolling average) for all seven conditions over 100K training steps. All conditions converge rapidly and plateau after ∼50K steps. (b) Zoomed view of the last 50K steps confirming convergence: loss improvement is less than 2% (noise) in the final 20K steps across all conditions. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-classifier agreement on training data annotations. Agreement rates and Cohen’s κ between LlavaGuard (primary annotator) and three alternative safety classifiers on a shared subset of training images [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emph{only} in their fraction of unsafe images (0\% to 9.6\%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6\% at 0\% contamination to 25.5\% at 5\%. A factorial design reveals that the \emph{proportion}, not the absolute count, of unsafe training images is the operative variable. The 16.6\% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk -- confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6\%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows unsafe output rates rise with the proportion of unsafe training images rather than their absolute count, plus a baseline risk that a safer encoder can lower.

read the letter

The central result here is that training the same model on datasets varying only in unsafe image fraction produces a clear monotonic increase in unsafe generations, from 16.6% at zero contamination up to 25.5% at 5%, and that proportion drives the effect more than raw count. The encoder ablation is a useful addition because it shows the dose-response survives while the floor drops when swapping in SafeCLIP.

They did the controlled part right by holding model architecture fixed and testing across multiple dataset scales from 100K to 8M. Using four separate classifiers and checking that FID, CLIPscore, and ImageReward stay flat is also sensible; it rules out the obvious quality trade-off excuse. The factorial design separating proportion from count is the part that actually moves the needle beyond prior observations.

The main gap is still the dataset construction. The abstract states the sets differ only in unsafe fraction, but without the exact filtering procedure, caption matching, or checks that visual statistics and quality stayed balanced, it's hard to know whether other variables moved along with the unsafe count. The classifier outputs also need more grounding; inter-rater agreement or a small human validation set would make the 16.6–25.5% numbers more credible. If those checks are solid in the full paper, the claim holds; if not, the monotonic pattern could be partly artifactual.

This is worth sending to referees who work on data curation and safety for generative models. The question is practical and the design is straightforward, so it merits review even if the methods section will need tightening.

Referee Report

3 major / 1 minor

Summary. The paper claims that training text-to-image models on datasets differing only in the fraction of unsafe images (0% to 9.6%) at scales from 100K to 8M produces a monotonic rise in output unsafety (measured by four classifiers) from 16.6% at zero contamination to 25.5% at 5%. A factorial design isolates proportion (not absolute count) as the operative variable. A 16.6% baseline at zero contamination is attributed to other components such as the frozen text encoder; an ablation with SafeCLIP lowers this floor to 9.6% while the dose-response persists. Safety filtering incurs no measurable degradation in FID, CLIPscore, or ImageReward, implying data curation and text-encoder safety are complementary interventions.

Significance. If the empirical results hold after verification of dataset construction and classifier validity, the work supplies a controlled demonstration that unsafe training proportion directly drives output unsafety, with the factorial design and text-encoder ablation providing evidence that proportion is causal and that residual risks arise from other model components. The absence of quality trade-offs strengthens the practical implication that curation is an effective, low-cost intervention complementary to encoder-level fixes.

major comments (3)

[Methods] Methods (dataset construction and factorial design): The claim that datasets 'differ only in their fraction of unsafe images' is load-bearing for the monotonic dose-response and 'proportion, not count' conclusion, yet the manuscript supplies no description of the curation procedure, caption generation, visual distribution matching, or controls for other statistics. Without these details, alternative explanations (e.g., correlated changes in caption style or image quality) cannot be excluded.
[Results] Results (safety classifier evaluation): The reported rates (16.6% to 25.5%) and monotonic relationship rest on four independent classifiers, but no inter-classifier agreement statistics, calibration curves, or human validation against perceived unsafety are provided. This measurement gap directly affects the reliability of the baseline, the dose-response, and the text-encoder ablation results.
[Ablations] Ablations (text encoder): The SafeCLIP ablation is cited to confirm the 16.6% floor arises from the text encoder, yet the manuscript does not report the precise experimental factors in the factorial design, sample sizes per cell, or any statistical test for the proportion-vs-count contrast. These omissions prevent assessment of whether the design isolates the claimed variable.

minor comments (1)

[Abstract] The abstract states results across 'several dataset scales (100K to 8M)' but does not include a table or figure breaking down unsafety rates by scale; adding this would clarify whether the proportion effect is scale-invariant.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important gaps in methodological transparency and validation that we will address in revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Methods] Methods (dataset construction and factorial design): The claim that datasets 'differ only in their fraction of unsafe images' is load-bearing for the monotonic dose-response and 'proportion, not count' conclusion, yet the manuscript supplies no description of the curation procedure, caption generation, visual distribution matching, or controls for other statistics. Without these details, alternative explanations (e.g., correlated changes in caption style or image quality) cannot be excluded.

Authors: We agree that the current manuscript does not provide sufficient detail on dataset construction to fully substantiate the claim that the datasets differ solely in unsafe-image fraction. Although the source datasets and filtering criteria are referenced, explicit descriptions of the curation pipeline, caption generation method, visual-distribution matching steps, and controls for confounding statistics (image quality, caption style, etc.) are missing. In the revised manuscript we will add a dedicated subsection in Methods that documents these procedures and the controls employed. revision: yes
Referee: [Results] Results (safety classifier evaluation): The reported rates (16.6% to 25.5%) and monotonic relationship rest on four independent classifiers, but no inter-classifier agreement statistics, calibration curves, or human validation against perceived unsafety are provided. This measurement gap directly affects the reliability of the baseline, the dose-response, and the text-encoder ablation results.

Authors: We acknowledge that the reliability of the four safety classifiers requires additional quantitative support. We will add inter-classifier agreement statistics (pairwise agreement rates and Cohen’s kappa) and any available calibration information to the revised Results section. Human validation against perceived unsafety was not performed in the original study; we will therefore note this as a limitation and reference the classifiers’ prior validation in the literature rather than claim new human-grounded evidence. revision: partial
Referee: [Ablations] Ablations (text encoder): The SafeCLIP ablation is cited to confirm the 16.6% floor arises from the text encoder, yet the manuscript does not report the precise experimental factors in the factorial design, sample sizes per cell, or any statistical test for the proportion-vs-count contrast. These omissions prevent assessment of whether the design isolates the claimed variable.

Authors: We agree that the factorial design must be described with greater precision. The revised manuscript will explicitly list the experimental factors, the number of samples per cell, and the statistical tests (including any regression or ANOVA results) used to demonstrate that proportion, rather than absolute count, drives the observed effect. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential reductions

full rationale

The paper reports direct experimental outcomes from training identical models on datasets that differ only in unsafe image fraction (0% to 9.6%), then measuring generated image unsafety via four classifiers. No equations, fitted parameters, or derivations appear in the supplied text. Claims such as the monotonic rise from 16.6% to 25.5% and the proportion-vs-count factorial result are presented as observed data points, not quantities defined in terms of themselves or reduced via self-citation. The text-encoder ablation is likewise an independent experimental contrast. Because the central results rest on external measurement rather than any internal definitional loop, the derivation chain (such as it is) is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions: that safety classifiers accurately quantify unsafety and that the constructed datasets isolate the unsafe fraction variable with no other differences.

axioms (2)

domain assumption Safety classifiers accurately measure image unsafety
Evaluation of generated images relies on four independent safety classifiers to determine output unsafety rates.
domain assumption Datasets differ only in the fraction of unsafe images
The experimental isolation of the proportion variable assumes all other dataset properties are held constant across conditions.

pith-pipeline@v0.9.1-grok · 5804 in / 1406 out tokens · 31006 ms · 2026-06-29T13:38:08.325571+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Birhane and V

A. Birhane and V . U. Prabhu. Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. IEEE, 2021

2021
[2]

Birhane, V

A. Birhane, V . U. Prabhu, and E. Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes.arXiv preprint arXiv:2110.01963, 2021

work page arXiv 2021
[3]

Birhane, S

A. Birhane, S. Han, V . Boddeti, S. Luccioni, et al. Into the laion’s den: Investigating hate in multimodal datasets.Advances in Neural Information Processing Systems, 36, 2024

2024
[4]

Black, M

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

2024
[5]

Brack, F

M. Brack, F. Friedrich, D. Hintersdorf, L. Struppek, P. Schramowski, and K. Kersting. SEGA: Instructing text-to-image models using semantic guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[6]

Brack, F

M. Brack, F. Friedrich, K. Kornmeier, L. Tsaban, P. Schramowski, K. Kersting, and A. Passos. LEdits++: Limitless image editing using text-to-image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[7]

Brack, S

M. Brack, S. Katakol, F. Friedrich, P. Schramowski, H. Ravi, K. Kersting, and A. Kale. How to train your text-to-image model: Evaluating design choices for synthetic training captions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, October 2025

2025
[8]

Carlini, J

N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V . Sehwag, F. Tramer, B. Balle, D. Ippolito, and E. Wallace. Extracting training data from diffusion models. In32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023. 11

2023
[9]

Stable diffusion safety checker

CompVis. Stable diffusion safety checker. https://huggingface.co/CompVis/ stable-diffusion-safety-checker , 2022. CLIP-based NSFW concept classifier shipped with Stable Diffusion

2022
[10]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorber, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning (ICML), 2024

2024
[11]

AI Act: Regulatory Framework on Artificial Intelligence

European Commission. AI Act: Regulatory Framework on Artificial Intelligence. https:// digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai , 2024. Regulation (EU) 2024/1689. Accessed: 2026-05-19

2024
[12]

Friedrich, W

F. Friedrich, W. Stammer, P. Schramowski, and K. Kersting. Revision transformers: Instructing language models to change their values. InEuropean Conference on Artificial Intelligence (ECAI), 2023

2023
[13]

Friedrich, M

F. Friedrich, M. Brack, L. Struppek, D. Hintersdorf, P. Schramowski, S. Luccioni, and K. Ker- sting. Auditing and instructing text-to-image generation models on fairness.AI and Ethics, 2024

2024
[14]

Friedrich, S

F. Friedrich, S. Tedeschi, P. Schramowski, M. Brack, R. Navigli, H. Nguyen, B. Li, and K. Ker- sting. LLMs lost in translation: M-ALERT uncovers cross-linguistic safety inconsistencies. In ICLR Workshop on Building Trust in Language Models and Applications, 2025

2025
[15]

Friedrich, T

F. Friedrich, T. G. Welsch, M. Brack, et al. Beyond overcorrection: Evaluating diversity in T2I models with DivBench.arXiv preprint arXiv:2507.03015, 2025

work page arXiv 2025
[16]

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

2024
[17]

Gandikota, J

R. Gandikota, J. Materzy´nska, J. Fiotto-Kaufman, and D. Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[18]

Gandikota, H

R. Gandikota, H. Orgad, Y . Belinkov, J. Materzy´nska, and D. Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

2024
[19]

Gebru, J

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021
[20]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, M. Riviere, S. Pathak, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. URL https://arxiv.org/abs/ 2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Ghosh, H

S. Ghosh, H. Frase, A. Williams, S. Luger, P. Röttger, F. Barez, S. McGregor, et al. MLCom- mons AILuminate: Introducing v1.0 of the AI risk and reliability benchmark.arXiv preprint arXiv:2503.05731, 2025

work page arXiv 2025
[22]

Nano banana (gemini 2.5 flash image): Multimodal image generation and editing

Google. Nano banana (gemini 2.5 flash image): Multimodal image generation and editing. https://www.digitalocean.com/resources/articles/nano-banana, 2025. AI image generation and editing model within the Gemini 2.5 Flash system

2025
[23]

M. Hall, L. van der Maaten, L. Gustafson, M. Jones, and A. Adcock. A systematic study of bias amplification.arXiv preprint arXiv:2201.11706, 2022

work page arXiv 2022
[24]

Härle, F

R. Härle, F. Friedrich, M. Brack, S. Wäldchen, B. Deiseroth, P. Schramowski, and K. Kersting. Measuring and guiding monosemanticity. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[25]

Helff, F

L. Helff, F. Friedrich, M. Brack, K. Kersting, and P. Schramowski. LlavaGuard: An open VLM- based framework for safeguarding vision datasets and models. InInternational Conference on Machine Learning (ICML), 2025. 12

2025
[26]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems, 30, 2017

2017
[27]

Hintersdorf, L

D. Hintersdorf, L. Struppek, M. Brack, F. Friedrich, P. Schramowski, and K. Kersting. Does CLIP know my face?Journal of Artificial Intelligence Research (JAIR), 2024

2024
[28]

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[30]

Kumari, B

N. Kumari, B. Zhang, S.-Y . Wang, E. Shechtman, R. Zhang, and J.-Y . Zhu. Ablating concepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[31]

K. Lee, H. Liu, M. Ryu, O. Watkins, Y . Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

G. Li, K. Chen, S. Zhang, J. Zhang, and T. Zhang. Art: Automatic red-teaming for text-to- image models to protect benign users. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[33]

L. Li, C. Chen, R. Qian, W. Hu, T.-J. Fu, J. Tong, X. Wang, B. Zhang, A. Schwing, W. Liu, and Y . Yang. Dit-air: Revisiting the efficiency of diffusion model architecture design in text to image generation.arXiv preprint arXiv:2503.10618, 2025

work page arXiv 2025
[34]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, pages 740–755. Springer, 2014

2014
[35]

S. Lu, Z. Wang, L. Li, Y . Liu, and A. W.-K. Kong. MACE: Mass concept erasure in diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[36]

A. S. Luccioni, C. Akiki, M. Mitchell, and Y . Jernite. Stable bias: Evaluating societal represen- tations in diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

2023
[37]

Midjourney: Ai-based image generation system

Midjourney, Inc. Midjourney: Ai-based image generation system. https://www.midjourney. com, 2025. Text-to-image model known for stylized and high-quality visual generation

2025
[38]

Mundt, A

M. Mundt, A. Ovalle, F. Friedrich, A. Pranav, S. Paul, et al. The cake that is intelligence and who gets to bake it: An AI analogy and its implications for participation.arXiv preprint arXiv:2502.03038, 2025

work page arXiv 2025
[39]

Nakamura, M

T. Nakamura, M. Mishra, S. Tedeschi, Y . Chai, J. T. Stillerman, F. Friedrich, et al. Aurora- M: Open source continual pre-training for multilingual language and code. InInternational Conference on Computational Linguistics (COLING) Industry Track, 2025

2025
[40]

Prx: Text-to-image generation via rectified flow transformer.HuggingFace blog,

Photoroom. Prx: Text-to-image generation via rectified flow transformer.HuggingFace blog,
[41]

Available athttps://huggingface.co/Photoroom/prx-1024-t2i-beta
[42]

Poppi, T

S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, and R. Cucchiara. Safe-clip: Removing nsfw concepts from vision-and-language models. InEuropean Conference on Computer Vision, pages 340–356. Springer, 2024

2024
[43]

Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. InProceedings of the 2023 ACM SIGSAC conference on computer and communications security, pages 3403–3417, 2023. 13

2023
[44]

Quaye, A

J. Quaye, A. Parrish, O. Inel, C. Rastogi, H. R. Kirk, M. Kahng, E. Van Liemt, M. Bartolo, J. Tsang, J. White, et al. Adversarial nibbler: An open red-teaming method for identifying diverse harms in text-to-image generation. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 388–406, 2024

2024
[45]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

2021
[46]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

2020
[47]

Rando, D

J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramèr. Red-teaming the stable diffusion safety filter. InNeurIPS ML Safety Workshop, 2022

2022
[48]

Reuel, A

A. Reuel, A. Ghosh, J. Chim, A. Tran, Y . Long, J. Mickel, et al. Who evaluates AI’s social impacts? mapping coverage and gaps in first and third party evaluations. InInternational Conference on Machine Learning (ICML), 2026

2026
[49]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[50]

Röttger, G

P. Röttger, G. Attanasio, F. Friedrich, J. Goldzycher, et al. MSTS: A multimodal safety test suite for vision-language models.arXiv preprint arXiv:2501.10057, 2025

work page arXiv 2025
[51]

Schramowski, C

P. Schramowski, C. Tauchmann, and K. Kersting. Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1350–1361, 2022

2022
[52]

Schramowski, M

P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023

2023
[53]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35: 25278–25294, 2022

2022
[54]

Seshadri, S

P. Seshadri, S. Singh, and Y . Elazar. The bias amplification paradox in text-to-image generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2024

2024
[55]

Solaiman, Z

I. Solaiman, Z. Talat, W. Agnew, L. Ahmad, D. Baker, S. L. Blodgett, C. Chen, H. Daumé III, J. Dodge, I. Duan, et al. Evaluating the social impact of generative AI systems in systems and society. InOxford Handbook on the Foundations and Regulation of Generative AI, 2023

2023
[56]

Somepalli, V

G. Somepalli, V . Singla, M. Goldblum, J. Geiping, and T. Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[57]

Startsev, A

V . Startsev, A. Ustyuzhanin, A. Kirillov, D. Baranchuk, and S. Kastryulin. Alchemist: Turning public text-to-image data into generative gold. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025

2025
[58]

Steed and A

R. Steed and A. Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 701–713, 2021

2021
[59]

Struppek, D

L. Struppek, D. Hintersdorf, F. Friedrich, P. Schramowski, and K. Kersting. Exploiting cultural biases via homoglyphs in text-to-image synthesis.Journal of Artificial Intelligence Research (JAIR), 2023. 14

2023
[60]

Tedeschi, F

S. Tedeschi, F. Friedrich, P. Schramowski, K. Kersting, R. Navigli, H. Nguyen, and B. Li. ALERT: A comprehensive benchmark for assessing large language models’ safety through red teaming. InWorkshop on Red Teaming Generative AI Models, 2024

2024
[61]

Wallace, M

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[62]

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li. Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benc...

2023
[63]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical report,
[64]

URLhttps://arxiv.org/abs/2508.02324

work page internal anchor Pith review Pith/arXiv arXiv
[65]

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36, 2024

2024
[66]

Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao. SneakyPrompt: Jailbreaking text-to-image generative models. InProceedings of the IEEE Symposium on Security and Privacy (S&P), 2024

2024
[67]

W. Zeng, D. Kurniawan, R. Mullins, Y . Liu, T. Saha, D. Ike-Njoku, J. Gu, Y . Song, C. Xu, J. Zhou, et al. Shieldgemma 2: Robust and tractable image content moderation.arXiv preprint arXiv:2504.01081, 2025

work page arXiv 2025
[68]

Y . Zeng, K. Klyman, A. Zhou, Y . Yang, M. Pan, R. Jia, D. Song, P. Liang, and B. Li. Ai risk categorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024

work page arXiv 2024
[69]

Zhang, K

E. Zhang, K. Wang, X. Xu, Z. Wang, and H. Shi. Forget-me-not: Learning to forget in text-to- image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

2024
[70]

Zheng, L

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, J. E. Gonzalez, I. Stoica, C. Barrett, and Y . Sheng. Sglang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 15 Appendix A Dose–response modeling To summarize potential saturation in the dose–response...

2024

[1] [1]

Birhane and V

A. Birhane and V . U. Prabhu. Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. IEEE, 2021

2021

[2] [2]

Birhane, V

A. Birhane, V . U. Prabhu, and E. Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes.arXiv preprint arXiv:2110.01963, 2021

work page arXiv 2021

[3] [3]

Birhane, S

A. Birhane, S. Han, V . Boddeti, S. Luccioni, et al. Into the laion’s den: Investigating hate in multimodal datasets.Advances in Neural Information Processing Systems, 36, 2024

2024

[4] [4]

Black, M

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

2024

[5] [5]

Brack, F

M. Brack, F. Friedrich, D. Hintersdorf, L. Struppek, P. Schramowski, and K. Kersting. SEGA: Instructing text-to-image models using semantic guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[6] [6]

Brack, F

M. Brack, F. Friedrich, K. Kornmeier, L. Tsaban, P. Schramowski, K. Kersting, and A. Passos. LEdits++: Limitless image editing using text-to-image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[7] [7]

Brack, S

M. Brack, S. Katakol, F. Friedrich, P. Schramowski, H. Ravi, K. Kersting, and A. Kale. How to train your text-to-image model: Evaluating design choices for synthetic training captions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, October 2025

2025

[8] [8]

Carlini, J

N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V . Sehwag, F. Tramer, B. Balle, D. Ippolito, and E. Wallace. Extracting training data from diffusion models. In32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023. 11

2023

[9] [9]

Stable diffusion safety checker

CompVis. Stable diffusion safety checker. https://huggingface.co/CompVis/ stable-diffusion-safety-checker , 2022. CLIP-based NSFW concept classifier shipped with Stable Diffusion

2022

[10] [10]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorber, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning (ICML), 2024

2024

[11] [11]

AI Act: Regulatory Framework on Artificial Intelligence

European Commission. AI Act: Regulatory Framework on Artificial Intelligence. https:// digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai , 2024. Regulation (EU) 2024/1689. Accessed: 2026-05-19

2024

[12] [12]

Friedrich, W

F. Friedrich, W. Stammer, P. Schramowski, and K. Kersting. Revision transformers: Instructing language models to change their values. InEuropean Conference on Artificial Intelligence (ECAI), 2023

2023

[13] [13]

Friedrich, M

F. Friedrich, M. Brack, L. Struppek, D. Hintersdorf, P. Schramowski, S. Luccioni, and K. Ker- sting. Auditing and instructing text-to-image generation models on fairness.AI and Ethics, 2024

2024

[14] [14]

Friedrich, S

F. Friedrich, S. Tedeschi, P. Schramowski, M. Brack, R. Navigli, H. Nguyen, B. Li, and K. Ker- sting. LLMs lost in translation: M-ALERT uncovers cross-linguistic safety inconsistencies. In ICLR Workshop on Building Trust in Language Models and Applications, 2025

2025

[15] [15]

Friedrich, T

F. Friedrich, T. G. Welsch, M. Brack, et al. Beyond overcorrection: Evaluating diversity in T2I models with DivBench.arXiv preprint arXiv:2507.03015, 2025

work page arXiv 2025

[16] [16]

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

2024

[17] [17]

Gandikota, J

R. Gandikota, J. Materzy´nska, J. Fiotto-Kaufman, and D. Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[18] [18]

Gandikota, H

R. Gandikota, H. Orgad, Y . Belinkov, J. Materzy´nska, and D. Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

2024

[19] [19]

Gebru, J

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021

[20] [20]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, M. Riviere, S. Pathak, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. URL https://arxiv.org/abs/ 2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Ghosh, H

S. Ghosh, H. Frase, A. Williams, S. Luger, P. Röttger, F. Barez, S. McGregor, et al. MLCom- mons AILuminate: Introducing v1.0 of the AI risk and reliability benchmark.arXiv preprint arXiv:2503.05731, 2025

work page arXiv 2025

[22] [22]

Nano banana (gemini 2.5 flash image): Multimodal image generation and editing

Google. Nano banana (gemini 2.5 flash image): Multimodal image generation and editing. https://www.digitalocean.com/resources/articles/nano-banana, 2025. AI image generation and editing model within the Gemini 2.5 Flash system

2025

[23] [23]

M. Hall, L. van der Maaten, L. Gustafson, M. Jones, and A. Adcock. A systematic study of bias amplification.arXiv preprint arXiv:2201.11706, 2022

work page arXiv 2022

[24] [24]

Härle, F

R. Härle, F. Friedrich, M. Brack, S. Wäldchen, B. Deiseroth, P. Schramowski, and K. Kersting. Measuring and guiding monosemanticity. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[25] [25]

Helff, F

L. Helff, F. Friedrich, M. Brack, K. Kersting, and P. Schramowski. LlavaGuard: An open VLM- based framework for safeguarding vision datasets and models. InInternational Conference on Machine Learning (ICML), 2025. 12

2025

[26] [26]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems, 30, 2017

2017

[27] [27]

Hintersdorf, L

D. Hintersdorf, L. Struppek, M. Brack, F. Friedrich, P. Schramowski, and K. Kersting. Does CLIP know my face?Journal of Artificial Intelligence Research (JAIR), 2024

2024

[28] [28]

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[30] [30]

Kumari, B

N. Kumari, B. Zhang, S.-Y . Wang, E. Shechtman, R. Zhang, and J.-Y . Zhu. Ablating concepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[31] [31]

K. Lee, H. Liu, M. Ryu, O. Watkins, Y . Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

G. Li, K. Chen, S. Zhang, J. Zhang, and T. Zhang. Art: Automatic red-teaming for text-to- image models to protect benign users. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[33] [33]

L. Li, C. Chen, R. Qian, W. Hu, T.-J. Fu, J. Tong, X. Wang, B. Zhang, A. Schwing, W. Liu, and Y . Yang. Dit-air: Revisiting the efficiency of diffusion model architecture design in text to image generation.arXiv preprint arXiv:2503.10618, 2025

work page arXiv 2025

[34] [34]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, pages 740–755. Springer, 2014

2014

[35] [35]

S. Lu, Z. Wang, L. Li, Y . Liu, and A. W.-K. Kong. MACE: Mass concept erasure in diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[36] [36]

A. S. Luccioni, C. Akiki, M. Mitchell, and Y . Jernite. Stable bias: Evaluating societal represen- tations in diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

2023

[37] [37]

Midjourney: Ai-based image generation system

Midjourney, Inc. Midjourney: Ai-based image generation system. https://www.midjourney. com, 2025. Text-to-image model known for stylized and high-quality visual generation

2025

[38] [38]

Mundt, A

M. Mundt, A. Ovalle, F. Friedrich, A. Pranav, S. Paul, et al. The cake that is intelligence and who gets to bake it: An AI analogy and its implications for participation.arXiv preprint arXiv:2502.03038, 2025

work page arXiv 2025

[39] [39]

Nakamura, M

T. Nakamura, M. Mishra, S. Tedeschi, Y . Chai, J. T. Stillerman, F. Friedrich, et al. Aurora- M: Open source continual pre-training for multilingual language and code. InInternational Conference on Computational Linguistics (COLING) Industry Track, 2025

2025

[40] [40]

Prx: Text-to-image generation via rectified flow transformer.HuggingFace blog,

Photoroom. Prx: Text-to-image generation via rectified flow transformer.HuggingFace blog,

[41] [41]

Available athttps://huggingface.co/Photoroom/prx-1024-t2i-beta

[42] [42]

Poppi, T

S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, and R. Cucchiara. Safe-clip: Removing nsfw concepts from vision-and-language models. InEuropean Conference on Computer Vision, pages 340–356. Springer, 2024

2024

[43] [43]

Y . Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y . Zhang. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. InProceedings of the 2023 ACM SIGSAC conference on computer and communications security, pages 3403–3417, 2023. 13

2023

[44] [44]

Quaye, A

J. Quaye, A. Parrish, O. Inel, C. Rastogi, H. R. Kirk, M. Kahng, E. Van Liemt, M. Bartolo, J. Tsang, J. White, et al. Adversarial nibbler: An open red-teaming method for identifying diverse harms in text-to-image generation. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 388–406, 2024

2024

[45] [45]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

2021

[46] [46]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

2020

[47] [47]

Rando, D

J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramèr. Red-teaming the stable diffusion safety filter. InNeurIPS ML Safety Workshop, 2022

2022

[48] [48]

Reuel, A

A. Reuel, A. Ghosh, J. Chim, A. Tran, Y . Long, J. Mickel, et al. Who evaluates AI’s social impacts? mapping coverage and gaps in first and third party evaluations. InInternational Conference on Machine Learning (ICML), 2026

2026

[49] [49]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[50] [50]

Röttger, G

P. Röttger, G. Attanasio, F. Friedrich, J. Goldzycher, et al. MSTS: A multimodal safety test suite for vision-language models.arXiv preprint arXiv:2501.10057, 2025

work page arXiv 2025

[51] [51]

Schramowski, C

P. Schramowski, C. Tauchmann, and K. Kersting. Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1350–1361, 2022

2022

[52] [52]

Schramowski, M

P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023

2023

[53] [53]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35: 25278–25294, 2022

2022

[54] [54]

Seshadri, S

P. Seshadri, S. Singh, and Y . Elazar. The bias amplification paradox in text-to-image generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2024

2024

[55] [55]

Solaiman, Z

I. Solaiman, Z. Talat, W. Agnew, L. Ahmad, D. Baker, S. L. Blodgett, C. Chen, H. Daumé III, J. Dodge, I. Duan, et al. Evaluating the social impact of generative AI systems in systems and society. InOxford Handbook on the Foundations and Regulation of Generative AI, 2023

2023

[56] [56]

Somepalli, V

G. Somepalli, V . Singla, M. Goldblum, J. Geiping, and T. Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[57] [57]

Startsev, A

V . Startsev, A. Ustyuzhanin, A. Kirillov, D. Baranchuk, and S. Kastryulin. Alchemist: Turning public text-to-image data into generative gold. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025

2025

[58] [58]

Steed and A

R. Steed and A. Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 701–713, 2021

2021

[59] [59]

Struppek, D

L. Struppek, D. Hintersdorf, F. Friedrich, P. Schramowski, and K. Kersting. Exploiting cultural biases via homoglyphs in text-to-image synthesis.Journal of Artificial Intelligence Research (JAIR), 2023. 14

2023

[60] [60]

Tedeschi, F

S. Tedeschi, F. Friedrich, P. Schramowski, K. Kersting, R. Navigli, H. Nguyen, and B. Li. ALERT: A comprehensive benchmark for assessing large language models’ safety through red teaming. InWorkshop on Red Teaming Generative AI Models, 2024

2024

[61] [61]

Wallace, M

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[62] [62]

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li. Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benc...

2023

[63] [63]

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical report,

[64] [64]

URLhttps://arxiv.org/abs/2508.02324

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36, 2024

2024

[66] [66]

Y . Yang, B. Hui, H. Yuan, N. Gong, and Y . Cao. SneakyPrompt: Jailbreaking text-to-image generative models. InProceedings of the IEEE Symposium on Security and Privacy (S&P), 2024

2024

[67] [67]

W. Zeng, D. Kurniawan, R. Mullins, Y . Liu, T. Saha, D. Ike-Njoku, J. Gu, Y . Song, C. Xu, J. Zhou, et al. Shieldgemma 2: Robust and tractable image content moderation.arXiv preprint arXiv:2504.01081, 2025

work page arXiv 2025

[68] [68]

Y . Zeng, K. Klyman, A. Zhou, Y . Yang, M. Pan, R. Jia, D. Song, P. Liang, and B. Li. Ai risk categorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024

work page arXiv 2024

[69] [69]

Zhang, K

E. Zhang, K. Wang, X. Xu, Z. Wang, and H. Shi. Forget-me-not: Learning to forget in text-to- image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

2024

[70] [70]

Zheng, L

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, J. E. Gonzalez, I. Stoica, C. Barrett, and Y . Sheng. Sglang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 15 Appendix A Dose–response modeling To summarize potential saturation in the dose–response...

2024