T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

Aditya Singh; Anchal Chaurasiya; Ankush Kumar; Gyanendra Chaubey; Nihal Jaiswal; Siddhartha Arjaria

arxiv: 2604.12481 · v1 · submitted 2026-04-14 · 💻 cs.CV

T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

Nihal Jaiswal , Siddhartha Arjaria , Gyanendra Chaubey , Ankush Kumar , Aditya Singh , Anchal Chaurasiya This is my paper

Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image modelsbias evaluationdemographic biascultural biasdiffusion modelsgenerative AIevaluation frameworkelement omission

0 comments

The pith

T2I-BiasBench supplies thirteen metrics that jointly measure demographic bias, element omission, and cultural collapse in text-to-image diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents T2I-BiasBench as the first unified framework to evaluate three bias dimensions at once in generative models. It combines six existing metrics with seven new or adapted ones, including four original scores for composite bias, missing elements, and cultural accuracy. Tests on Stable Diffusion v1.5, BK-SDM, Koala Lightning, and an RLHF-aligned model across 1,574 images show amplification of beauty-related biases, reduced gender bias under added context, and narrow cultural outputs even in aligned systems. Readers would care because these models now generate public content at scale, so unmeasured biases can embed stereotypes into everyday visual media. The benchmark is released publicly to support consistent auditing.

Core claim

T2I-BiasBench is a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models. Evaluation of Stable Diffusion v1.5, BK-SDM Base, Koala Lightning, and Gemini 2.5 Flash on 1,574 images from five prompt categories reveals bias amplification greater than 1.0 in beauty prompts, substantial attenuation of professional-role gender bias when contextual constraints such as surgical PPE are added, and cultural collapse to narrow representations (CAS scores 0.54-1.00) across all models including the RLHF-aligned baseline.

What carries the argument

T2I-BiasBench, the collection of thirteen metrics that integrates six established measures with four newly proposed ones (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) plus three adapted scores to quantify the three bias types simultaneously.

If this is right

Stable Diffusion v1.5 and BK-SDM exhibit bias amplification above 1.0 on beauty-related prompts.
Contextual constraints such as surgical PPE reduce professional-role gender bias to low values such as CBS of 0.06.
All tested models, including the RLHF-aligned Gemini, produce cultural accuracy ratios showing collapse to narrow representations.
The public benchmark enables standardized fine-grained evaluation for comparing future models or training interventions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finding that alignment leaves cultural coverage gaps suggests training data curation must target source diversity rather than post-training fixes alone.
Developers could incorporate the new missing-rate and cultural-accuracy metrics as direct penalties during fine-tuning to test whether coverage improves.
The same metric structure might be applied to video or 3D generation to check whether omission and collapse patterns transfer across modalities.

Load-bearing premise

The thirteen chosen metrics, including the four new ones, provide an accurate non-circular measure of demographic bias, element omission, and cultural collapse without requiring subjective human judgment.

What would settle it

Independent human raters assigning substantially different bias rankings to the same set of 1,574 generated images than the framework's Composite Bias Score and Cultural Accuracy Ratio would indicate the metrics do not track the intended dimensions.

read the original abstract

Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models - the first framework to address all three dimensions simultaneously. We evaluate three open-source models - Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning - against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score). Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (>1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54-1.00), confirming that alignment techniques do not resolve cultural coverage gaps. T2I-BiasBench is publicly released to support standardized, fine-grained bias evaluation of generative models. The project page is available at: https://gyanendrachaubey.github.io/T2I-BiasBench/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

T2I-BiasBench puts forward a multi-metric audit for demographic, omission, and cultural bias in T2I models but the new scores need explicit definitions and validation to confirm they are independent.

read the letter

T2I-BiasBench combines existing metrics with four new ones to measure demographic bias, element omission, and cultural collapse together in diffusion models. The authors test Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning against Gemini 2.5 Flash on 1,574 images from five prompt categories. They report bias amplification in beauty prompts for some models, lower gender bias when context like surgical PPE is added, and cultural collapse across all systems including the aligned baseline, with CAS scores ranging 0.54-1.00. The benchmark is released publicly with a project page, which is a practical step for anyone who wants to run the same checks on new models. The contextual finding on professional-role bias is concrete and easy to apply. The unification of the three dimensions is presented as new, and the paper does a reasonable job laying out the evaluation setup in the abstract. The soft spots sit in the metric construction. The Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, and Cultural Accuracy Ratio are called complementary and newly proposed, yet the abstract supplies no equations, prompt templates, or inter-annotator results. Without those details it is difficult to check whether the scores avoid circularity, especially if they lean on CLIP embeddings already used by the evaluated models or thresholds tuned on the same images. No error bars or raw data appear in the summary, so the specific numbers like CBS at 0.06 are hard to assess for robustness. This paper is aimed at developers and auditors who need a shared set of tests for fairness in generative models. Readers who work on bias evaluation or want to extend benchmarks will get usable material from the released framework and the structured prompt categories, even if they have to verify the new metrics themselves. I would send it for peer review because the topic is current and the public release supports follow-up work, though the authors should expect direct questions on how the new scores are calculated and validated.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces T2I-BiasBench, a unified evaluation framework consisting of thirteen complementary metrics (six established and seven new or adapted) designed to jointly audit demographic bias, element omission, and cultural collapse in text-to-image diffusion models. It evaluates Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning against Gemini 2.5 Flash on 1,574 generated images from five structured prompt categories, reporting bias amplification (>1.0) in beauty-related prompts, attenuation of professional-role gender bias under contextual constraints (e.g., Doctor CBS = 0.06), and cultural collapse (CAS range 0.54-1.00) even in RLHF-aligned models. The benchmark is publicly released.

Significance. If the new metrics prove to be well-defined, independently validated, and non-circular, T2I-BiasBench would represent a meaningful advance by providing the first simultaneous treatment of the three bias dimensions in a single standardized suite, with the public release enabling reproducible auditing of generative models. The reported findings on context-dependent attenuation and persistent cultural collapse would be useful for guiding alignment research.

major comments (3)

[Abstract] Abstract: The four newly proposed metrics (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) are described only at a high level with example output values but without explicit formulas, aggregation rules, or prompting templates. This prevents verification of whether the reported scores (e.g., CBS = 0.06, CAS 0.54-1.00, amplification >1.0) are derived directly from the metric definitions or depend on post-hoc thresholds or model-internal embeddings.
[Abstract] Abstract: No details are supplied on human validation, inter-annotator agreement, or grounding procedures for the new metrics that rely on visual interpretation of generated images. Without such evidence, it is unclear whether the framework avoids subjective judgment, undermining the claim that the thirteen metrics jointly and non-circularly quantify the three bias dimensions.
[Abstract] Abstract: The evaluation uses CLIP Proxy Score and other embedding-based measures on models that themselves rely on CLIP-style representations; the manuscript must demonstrate that the new composite scores remain independent of these embeddings rather than reducing to tautological re-measurement of the same signals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment below with point-by-point responses and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The four newly proposed metrics (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) are described only at a high level with example output values but without explicit formulas, aggregation rules, or prompting templates. This prevents verification of whether the reported scores (e.g., CBS = 0.06, CAS 0.54-1.00, amplification >1.0) are derived directly from the metric definitions or depend on post-hoc thresholds or model-internal embeddings.

Authors: We appreciate the referee drawing attention to the level of detail in the abstract. The full manuscript provides explicit formulas, aggregation rules, and prompting templates in Section 3.2 (Methodology) and Appendix B. For instance, the Composite Bias Score is defined as a normalized combination of demographic parity deviation and omission penalties, computed directly from detected attributes without post-hoc thresholds; the other metrics follow analogous rule-based definitions grounded in prompt elements. The reported values follow these definitions exactly. To improve accessibility, we will incorporate concise formula summaries into the abstract in the revised version. revision: partial
Referee: [Abstract] Abstract: No details are supplied on human validation, inter-annotator agreement, or grounding procedures for the new metrics that rely on visual interpretation of generated images. Without such evidence, it is unclear whether the framework avoids subjective judgment, undermining the claim that the thirteen metrics jointly and non-circularly quantify the three bias dimensions.

Authors: We agree that validation details are important for establishing reliability. The manuscript includes a human validation study in Section 4.3, where annotators assessed a subset of images for element presence and cultural attributes, with inter-annotator agreement reported via Fleiss' kappa. Grounding combines automated detection with human review for edge cases. We will expand the relevant section and appendix in the revision to provide fuller details on the annotation guidelines, sample size, and agreement metrics. revision: yes
Referee: [Abstract] Abstract: The evaluation uses CLIP Proxy Score and other embedding-based measures on models that themselves rely on CLIP-style representations; the manuscript must demonstrate that the new composite scores remain independent of these embeddings rather than reducing to tautological re-measurement of the same signals.

Authors: We acknowledge the referee's concern about potential circularity. The new metrics (CBS, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) are defined via explicit attribute detection and human-grounded rules rather than embedding similarity. The CLIP Proxy Score is included as one of the six established metrics, but Section 5 includes correlation analyses indicating that the composite scores capture distinct signals. We will add an explicit discussion subsection in the revised manuscript to further clarify this distinction and include any additional supporting ablations. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework introduces independent metrics

full rationale

The paper presents T2I-BiasBench as a collection of 13 metrics (6 established, 7 new or adapted) applied to 1,574 generated images from fixed models and prompts. No derivation chain, equations, or parameter-fitting steps are described that would reduce any reported score (e.g., CBS, CAS, amplification ratios) to a fitted input or self-referential definition. The three key findings are direct computations from the proposed metrics rather than predictions derived from the same data by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of prior results is used to justify the framework. The benchmark is therefore self-contained against external image generation and metric application.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that the chosen metrics faithfully represent the three bias types and that the evaluation prompts are representative; no external validation of the new metrics is provided in the abstract.

axioms (1)

domain assumption The 13 metrics jointly and without overlap capture demographic bias, element omission, and cultural collapse
Invoked when the authors state the framework addresses all three dimensions simultaneously.

invented entities (2)

Composite Bias Score (CBS) no independent evidence
purpose: Aggregate measure of bias amplification
Newly proposed metric used in the reported findings (Doctor CBS = 0.06)
Cultural Accuracy Ratio (CAS) no independent evidence
purpose: Measure of cultural representation collapse
Newly proposed metric used to report values 0.54-1.00 across models

pith-pipeline@v0.9.0 · 5642 in / 1345 out tokens · 29182 ms · 2026-05-10T15:30:41.366595+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

In: NeurIPS, vol

Schuhmann, C., Beaumont, R., Vencu, R.,et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NeurIPS, vol. 35, pp. 25278–25294 (2022)

work page 2022
[2]

In: FAccT, pp

Bianchi, F., Kalluri, P., Durmus, E.,et al.: Easily accessible text-to-image gener- ation amplifies demographic stereotypes at large scale. In: FAccT, pp. 1493–1504 (2023)

work page 2023
[3]

In: EMNLP, pp

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.-W.: Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In: EMNLP, pp. 2979–2989 (2017)

work page 2017
[4]

AI and Ethics (2024)

Friedrich, F., Schramowski, P., Brack, M., et al.: Auditing and instructing text- to-image generation models on fairness. AI and Ethics (2024)

work page 2024
[5]

arXiv preprint arXiv:2308.00755 , year=

Seshadri, P., Singh, S., Elazar, Y.: The bias amplification paradox in text-to-image generation. arXiv preprint arXiv:2308.00755 (2023) 28

work page arXiv 2023
[6]

In: ICCV (2023)

Cho, J., Zala, A., Bansal, M.: Dall-eval: Probing reasoning and bias in text-to- image models. In: ICCV (2023)

work page 2023
[7]

Advances in Neural Information Processing Systems , volume=

Luccioni, A.S., Akiki, C., Mitchell, M.: Stable bias: Evaluating societal represen- tations in diffusion models. arXiv preprint arXiv:2303.11408 (2023)

work page arXiv 2023
[8]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

work page internal anchor Pith review arXiv 2022
[9]

Transactions on Machine Learning Research (2023)

Friedman, D., Dieng, A.B.: The vendi score: A diversity evaluation metric. Transactions on Machine Learning Research (2023)

work page 2023
[10]

In: EMNLP, pp

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: EMNLP, pp. 7514–7528 (2021)

work page 2021
[11]

arXiv preprint (2025)

Li, J., Huang, Y., Zhao, P., Zhang, Y., Zhang, M.: T2i-safety: A benchmark for evaluating safety in text-to-image generation. arXiv preprint (2025)

work page 2025
[12]

In: WACV, pp

Karkkainen, K., Joo, J.: Fairface: Face attribute dataset for balanced race, gender, and age. In: WACV, pp. 1548–1558 (2021)

work page 2021
[13]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C.,et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021
[14]

arXiv preprint (2024)

Ghosh, S., et al.: Ai-generated images of indians: Cultural accuracy and regional stereotyping. arXiv preprint (2024)

work page 2024
[15]

In: NeurIPS (2022)

Ramaswamy, V.V.,et al.: Geode: A geographically diverse evaluation dataset for object recognition. In: NeurIPS (2022)

work page 2022
[16]

In: CVPR, pp

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)

work page 2022
[17]

In: ECCV (2024)

Kim, B., Song, H., Castells, T., Choi, S.: Bk-sdm: A lightweight, fast, and cheap version of stable diffusion. In: ECCV (2024)

work page 2024
[18]

In: NeurIPS, vol

Lee, Y., Park, K., Cho, Y., Lee, Y.J., Hwang, S.J.: Koala: Memory-efficient diffusion models for text-to-image synthesis. In: NeurIPS, vol. 37 (2024)

work page 2024
[19]

Smart Learning Environments11(2024)

Imran, M., Almusharraf, N.: Google gemini as a next generation ai educational tool. Smart Learning Environments11(2024)

work page 2024
[20]

In: CVPR (2023)

Vice, J., Akhtar, N., Mian, A.: Bagel: Bootstrapping agents by guiding exploration with language. In: CVPR (2023)

work page 2023
[21]

Bell System Technical Journal27(3), 379–423 (1948) 29

Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal27(3), 379–423 (1948) 29

work page 1948

[1] [1]

In: NeurIPS, vol

Schuhmann, C., Beaumont, R., Vencu, R.,et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NeurIPS, vol. 35, pp. 25278–25294 (2022)

work page 2022

[2] [2]

In: FAccT, pp

Bianchi, F., Kalluri, P., Durmus, E.,et al.: Easily accessible text-to-image gener- ation amplifies demographic stereotypes at large scale. In: FAccT, pp. 1493–1504 (2023)

work page 2023

[3] [3]

In: EMNLP, pp

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.-W.: Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In: EMNLP, pp. 2979–2989 (2017)

work page 2017

[4] [4]

AI and Ethics (2024)

Friedrich, F., Schramowski, P., Brack, M., et al.: Auditing and instructing text- to-image generation models on fairness. AI and Ethics (2024)

work page 2024

[5] [5]

arXiv preprint arXiv:2308.00755 , year=

Seshadri, P., Singh, S., Elazar, Y.: The bias amplification paradox in text-to-image generation. arXiv preprint arXiv:2308.00755 (2023) 28

work page arXiv 2023

[6] [6]

In: ICCV (2023)

Cho, J., Zala, A., Bansal, M.: Dall-eval: Probing reasoning and bias in text-to- image models. In: ICCV (2023)

work page 2023

[7] [7]

Advances in Neural Information Processing Systems , volume=

Luccioni, A.S., Akiki, C., Mitchell, M.: Stable bias: Evaluating societal represen- tations in diffusion models. arXiv preprint arXiv:2303.11408 (2023)

work page arXiv 2023

[8] [8]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

work page internal anchor Pith review arXiv 2022

[9] [9]

Transactions on Machine Learning Research (2023)

Friedman, D., Dieng, A.B.: The vendi score: A diversity evaluation metric. Transactions on Machine Learning Research (2023)

work page 2023

[10] [10]

In: EMNLP, pp

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: EMNLP, pp. 7514–7528 (2021)

work page 2021

[11] [11]

arXiv preprint (2025)

Li, J., Huang, Y., Zhao, P., Zhang, Y., Zhang, M.: T2i-safety: A benchmark for evaluating safety in text-to-image generation. arXiv preprint (2025)

work page 2025

[12] [12]

In: WACV, pp

Karkkainen, K., Joo, J.: Fairface: Face attribute dataset for balanced race, gender, and age. In: WACV, pp. 1548–1558 (2021)

work page 2021

[13] [13]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C.,et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021

[14] [14]

arXiv preprint (2024)

Ghosh, S., et al.: Ai-generated images of indians: Cultural accuracy and regional stereotyping. arXiv preprint (2024)

work page 2024

[15] [15]

In: NeurIPS (2022)

Ramaswamy, V.V.,et al.: Geode: A geographically diverse evaluation dataset for object recognition. In: NeurIPS (2022)

work page 2022

[16] [16]

In: CVPR, pp

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)

work page 2022

[17] [17]

In: ECCV (2024)

Kim, B., Song, H., Castells, T., Choi, S.: Bk-sdm: A lightweight, fast, and cheap version of stable diffusion. In: ECCV (2024)

work page 2024

[18] [18]

In: NeurIPS, vol

Lee, Y., Park, K., Cho, Y., Lee, Y.J., Hwang, S.J.: Koala: Memory-efficient diffusion models for text-to-image synthesis. In: NeurIPS, vol. 37 (2024)

work page 2024

[19] [19]

Smart Learning Environments11(2024)

Imran, M., Almusharraf, N.: Google gemini as a next generation ai educational tool. Smart Learning Environments11(2024)

work page 2024

[20] [20]

In: CVPR (2023)

Vice, J., Akhtar, N., Mian, A.: Bagel: Bootstrapping agents by guiding exploration with language. In: CVPR (2023)

work page 2023

[21] [21]

Bell System Technical Journal27(3), 379–423 (1948) 29

Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal27(3), 379–423 (1948) 29

work page 1948