pith. machine review for the scientific record. sign in

arxiv: 2605.10442 · v2 · submitted 2026-05-11 · 💻 cs.CY · cs.AI· cs.CL

Recognition: no theorem link

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:15 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CL
keywords LLMsstereotypessocial biasmultilingual evaluationopen-ended generationharmfulness ratingscultural adaptation
0
0 comments X

The pith

Every tested LLM produces harmful stereotypes in open-ended story generation, with the specific associations shifting according to the prompt language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work builds a large-scale multilingual dataset of stories generated by 23 LLMs in 10 languages to detect when models over-represent certain socio-demographic groups in particular roles or outcomes. Statistical analysis of more than 650,000 stories reveals over 1,500 associations that humans and the models themselves rate as harmful. These patterns appear in every model examined, cross provider boundaries, and adapt to the cultural context of the prompt language rather than appearing as fixed universal biases. The alignment between human and model harm ratings suggests the stereotypes are recognizable and consistent enough to be tracked systematically.

Core claim

StereoTales generates open-ended stories, annotates the protagonist across 19 socio-demographic dimensions for each of 79 attributes, and applies statistical tests to surface over-represented associations; more than 1,500 of these receive harm ratings from both human panels and the LLMs, showing that harmful stereotypes arise in every model, are largely shared across providers, and intensify against groups salient in the prompt language.

What carries the argument

The StereoTales evaluation pipeline, which combines large-scale open-ended story generation, automated socio-demographic annotation, statistical over-representation tests, and dual human-LLM harmfulness rating.

If this is right

  • Harmful stereotypes appear in open-ended generation for every model size and provider tested.
  • Stereotypes do not transfer as a fixed set but adapt to the cultural context of the prompt language.
  • Human and LLM harmfulness ratings align with Spearman correlation 0.62, with disagreements concentrated on particular attribute classes.
  • More than 1,500 over-represented associations are identified across the 10-language dataset.
  • The associations are largely shared across different model providers rather than appearing as isolated failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Mitigation techniques may need to be developed separately for each language rather than applied uniformly across models.
  • Deploying the same model in a new linguistic region could increase stereotyping of groups that are locally visible.
  • The observed human-LLM rating alignment opens the possibility of using models themselves to monitor and flag emerging stereotypes during deployment.
  • The patterns may extend beyond story writing to other open-ended tasks such as dialogue or summarization.

Load-bearing premise

That over-representation of a socio-demographic profile in the generated stories reliably signals a harmful stereotype rather than an artifact of the story prompt or common narrative conventions.

What would settle it

Running the identical generation, annotation, and rating pipeline on a fresh set of models or languages and finding zero over-represented associations that both humans and LLMs rate as harmful would falsify the claim that every model emits consequential harmful stereotypes.

Figures

Figures reproduced from arXiv: 2605.10442 by Bazire Houssin, Beno\^it Mal\'ezieux, \'Etienne Duchesne, Matteo Dora, Pierre Le Jeune, Stefano Palminteri, Weixuan Xiao.

Figure 1
Figure 1. Figure 1: Overview of the methodology. Prompts are built by combining 19 demographic attributes with a catalogue of narrative scenarios; each fixes a single attribute value and is submitted to the LLM under test, which generates a short story. An ensemble of three LLM extractors then recovers the attribute profile from the story, and co-occurrences are aggregated into contingency tables tested at the attribute and v… view at source ↗
Figure 2
Figure 2. Figure 2: (A) Harmful and benign associations counts generated by each model. (B) Overall benign/harmful distribution and harmful associations split by models universality bins. (C) Examples of the top harmful and benign associations observed across models. distributed: a few attributes – most notably professional field, education, employment status, and income level – concentrate the bulk of them (full distribution… view at source ↗
Figure 3
Figure 3. Figure 3: Language specificity of bias emission. (A) Per-model language reach. (B) Pairwise Jaccard overlap of associations (note: uk = Ukrainian) showing a possible West-European block (solid) and a weaker LATAM block (dashed). (C) Selected harmful examples, local (top) vs. globally shared (bottom); cells show the number of models emitting the association. (D) Per-language tests of the unmarked (top) and protected … view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of model and human harmfulness judgments on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Consent form. Some parts are redacted to satisfy double-blind requirements. [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Instructions (part 1). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Instructions (part 2, top half). 33 [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Instructions (part 2, bottom half). 34 [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comprehension check. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example trial: harmfulness and realism questions. [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A Number of significant benign and harmful associations extracted for every pair of attribute categorie aggregated across the 23 models in our panel, and B only harmful associations. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Number of models producing an association for each pair of attribute values. Cells of [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of benign and harmful associations produced by each model and split by [PITH_FULL_IMAGE:figures/full_fig_p040_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: (top) Influence of model capabilities on the number of harmful associations produced, [PITH_FULL_IMAGE:figures/full_fig_p050_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Harmful associations count for each model, grouped by provider. [PITH_FULL_IMAGE:figures/full_fig_p051_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Distribution of benign and harmful associations across the three prompt templates. [PITH_FULL_IMAGE:figures/full_fig_p052_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Distribution of benign and harmful associations across the three prompt templates for [PITH_FULL_IMAGE:figures/full_fig_p052_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Per-model reproduction of Figure 3A. For each generator, distribution of the number of prompt languages in which a harmful (red) or benign (green) association is significantly emitted. The local-skewed shape with a secondary mode at full coverage holds across all 23 models. Interpretation. The bimodal shape is consistent with two qualitatively different classes of bias vocabulary co-existing in the models… view at source ↗
Figure 19
Figure 19. Figure 19: Distribution of effective language reach [PITH_FULL_IMAGE:figures/full_fig_p057_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Average-linkage dendrogram on 1 − Jaccard distances between per-language harmful￾association sets, with the West-European and Iberian nodes annotated by their bootstrap support over 1,000 resamples of the 236 harmful associations. American share under English prompts may reflect alignment recipes that amplify representation of the US Latin-American origin population). Spanish (66.7%) and Portuguese (79.5%… view at source ↗
Figure 21
Figure 21. Figure 21: Per-language harmful output vs. log10 CommonCrawl share. Left: number of distinct harmful associations. Right: number of harmful (model, association) emissions. Spearman ρ and permutation p-value reported in each title. H.7 Per-language tests of unmarked-reduction and protected-increase predictions This appendix gives the full methodology and per-row results of the unmarked-reduction and protected￾increas… view at source ↗
Figure 22
Figure 22. Figure 22: Three views of harmfulness rating variability, comparing humans (blue) and the ensemble [PITH_FULL_IMAGE:figures/full_fig_p065_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Mean ∆ = model − human harmfulness per (model, attribute) cell, grouped by model family. Columns are sorted by the global per-attribute ∆ from [PITH_FULL_IMAGE:figures/full_fig_p066_23.png] view at source ↗
read the original abstract

Multilingual studies of social bias in open-ended LLM generation remain limited: most existing benchmarks are English-centric, template-based, or restricted to recognizing pre-specified stereotypes. We introduce StereoTales, a multilingual dataset and evaluation pipeline for systematically studying the emergence of social bias in open-ended LLM generation. The dataset covers 10 languages and 79 socio-demographic attributes, and comprises over 650k stories generated by 23 recent LLMs, each annotated with the socio-demographic profile of the protagonist across 19 dimensions. From these, we apply statistical tests to identify more than 1{,}500 over-represented associations, which we then rate for harmfulness through both a panel of humans (N = 247) and the same LLMs. We report three main findings. \textbf{(i)} Every model we evaluate emits consequential harmful stereotypes in open-ended generation, regardless of size or capabilities, and these associations are largely shared across providers rather than isolated misbehaviors. \textbf{(ii)} Prompt language strongly shapes which stereotypes appear: rather than transferring as a shared set of biases, harmful associations adapt culturally to the prompt language and amplify bias against locally salient protected groups. \textbf{(iii)} Human and LLM harmfulness judgments are broadly aligned (Spearman $\rho=0.62$), with disagreements concentrating on specific attribute classes rather than specific providers. To support further analyses, we release the evaluation code and the dataset, including model generations, attribute annotations, and harmfulness ratings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces StereoTales, a multilingual dataset and evaluation pipeline for open-ended stereotype discovery in LLMs. It generates over 650k stories across 10 languages and 79 socio-demographic attributes using 23 recent models, annotates protagonists on 19 dimensions, applies statistical tests to flag more than 1,500 over-represented associations, and rates these for harmfulness via a human panel (N=247) and LLM judgments. The three main claims are that every evaluated model emits consequential harmful stereotypes regardless of size or provider, that prompt language shapes and culturally adapts these associations, and that human and LLM harm ratings align (Spearman ρ=0.62).

Significance. If the central claims hold after addressing methodological gaps, the work offers a scalable, open-ended, multilingual alternative to template-based or English-centric bias benchmarks. The dataset scale, statistical testing against baselines, human validation, and public release of generations, annotations, and ratings would be valuable contributions to bias measurement in LLMs.

major comments (1)
  1. [evaluation pipeline / statistical tests] The statistical tests used to identify the >1,500 over-represented associations (described in the evaluation pipeline) do not include a documented null model such as matched prompts with randomized or absent demographic cues, or permutation baselines that preserve prompt structure. Without this, the tests cannot reliably distinguish model-intrinsic stereotypes from distributional skews introduced by the explicit socio-demographic references in the generation prompts. This directly affects the load-bearing claims that every model emits harmful stereotypes 'regardless of size or capabilities' and that associations are 'largely shared across providers'.
minor comments (2)
  1. [Abstract] The abstract states that statistical tests were applied but does not specify exact significance thresholds or correction methods for multiple comparisons.
  2. [Dataset construction] The description of attribute annotation validation could be expanded to clarify inter-annotator agreement metrics and how the 19 dimensions were operationalized.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment on the evaluation pipeline below and will incorporate the suggested controls in the revised manuscript to strengthen the statistical claims.

read point-by-point responses
  1. Referee: The statistical tests used to identify the >1,500 over-represented associations (described in the evaluation pipeline) do not include a documented null model such as matched prompts with randomized or absent demographic cues, or permutation baselines that preserve prompt structure. Without this, the tests cannot reliably distinguish model-intrinsic stereotypes from distributional skews introduced by the explicit socio-demographic references in the generation prompts. This directly affects the load-bearing claims that every model emits harmful stereotypes 'regardless of size or capabilities' and that associations are 'largely shared across providers'.

    Authors: We agree that a documented null model is necessary to more rigorously isolate model-intrinsic effects from prompt-induced distributional skews. Our current pipeline applies chi-squared goodness-of-fit tests (with Bonferroni correction) comparing observed attribute co-occurrence frequencies in the generated stories against a uniform baseline expectation conditioned on the explicit socio-demographic cue in the prompt. While this already goes beyond raw frequency counts, it does not fully address potential phrasing artifacts. In the revision we will add: (1) permutation baselines that randomly reassign protagonist attribute labels to stories while exactly preserving prompt structure, generation length, and model sampling parameters; and (2) a matched control set of prompts that replace specific socio-demographic references with neutral phrasing (e.g., “a person” or randomized attribute values). We will re-compute the >1,500 flagged associations under these nulls, report the fraction that remain significant, and update Section 3 (Evaluation Pipeline) and the corresponding results in Section 4. These additions will directly support the claims that the stereotypes are model-driven and largely shared across providers rather than artifacts of prompt construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline is self-contained

full rationale

The paper's central claims rest on generating 650k stories via explicit prompts, applying statistical tests to detect over-represented socio-demographic associations in the output data, and obtaining independent human (N=247) plus LLM harm ratings. These steps are data-driven measurements against statistical baselines rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or sections reduce the reported associations or harm findings to the prompting format by construction; the derivation chain remains open to external falsification via the released dataset and code. The absence of documented null-model details affects validity but does not create circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that over-represented associations in generated text constitute stereotypes and that harm ratings by humans and models are meaningful proxies. No new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • over-representation significance threshold
    Determines which of the tested associations are counted among the 1,500+ flagged stereotypes.
axioms (1)
  • domain assumption LLM-generated stories reflect internal model associations that can be interpreted as stereotypes
    Invoked when treating statistical over-representation as evidence of bias emission.

pith-pipeline@v0.9.0 · 5605 in / 1256 out tokens · 44274 ms · 2026-05-13T07:15:16.696236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

132 extracted references · 132 canonical work pages · 11 internal anchors

  1. [1]

    URLhttps://commoncrawl.org/

    Common crawl, 2026. URLhttps://commoncrawl.org/. Crawl CC-MAIN-2026-08

  2. [2]

    Abrar, N

    A. Abrar, N. T. Oeshy, M. Kabir, and S. Ananiadou. Religious bias landscape in language and text-to-image models: Analysis, detection, and debiasing strategies.AI & SOCIETY, pages 1–27, 2025

  3. [3]

    Acerbi and J

    A. Acerbi and J. M. Stubbersfield. Large language models show human-like content biases in transmission chain experiments.Proceedings of the National Academy of Sciences, 120(44): e2313790120, 2023. doi: 10.1073/pnas.2313790120. URL https://www.pnas.org/doi/ abs/10.1073/pnas.2313790120

  4. [4]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  5. [5]

    G. V . Aher, R. I. Arriaga, and A. T. Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InProceedings of the 40th International Confer- ence on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 337–371. PMLR, 2023

  6. [6]

    M. AI. Mistral small 4 119b technical card. Hugging Face Model Repository, March 2026. URLhttps://huggingface.co/mistralai/Mistral-Small-4-119B-2603

  7. [7]

    Andriushchenko, A

    M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, J. Z. Kolter, M. Fredrikson, Y . Gal, and X. Davies. Agentharm: A benchmark for measuring harmfulness of llm agents. InInternational Conference on Learning Representations (ICLR), 2025

  8. [8]

    Claude opus 4.6 system card

    Anthropic. Claude opus 4.6 system card. Technical report, Anthropic, March 2026. URL https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288. pdf. Accessed: 2026-05-02

  9. [9]

    Anthropic

    A. Anthropic. System card: Claude opus 4 & claude sonnet 4.Claude-4 Model Card, 2025

  10. [10]

    X. Bai, A. Wang, I. Sucholutsky, and T. L. Griffiths. Measuring implicit bias in explicitly unbiased large language models.arXiv preprint arXiv:2402.04105, 2024

  11. [11]

    Y . Bang, Z. Ji, A. Schelten, A. Hartshorn, T. Fowler, C. Zhang, N. Cancedda, and P. Fung. Hallulens: Llm hallucination benchmark.arXiv preprint arXiv:2504.17550, 2025. 10

  12. [12]

    Barikeri, A

    S. Barikeri, A. Lauscher, I. Vuli´c, and G. Glavaš. Redditbias: A real-world resource for bias evaluation and debiasing of conversational language models. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1941–1...

  13. [13]

    E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

  14. [14]

    Benjamini and D

    Y . Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency.The Annals of Statistics, 29(4):1165–1188, 2001. ISSN 00905364, 21688966. URLhttp://www.jstor.org/stable/2674075

  15. [15]

    W. Bergsma. A bias-correction for cramér’s v and tschuprow’s t.Journal of the Korean Statistical Society, 42(3):323–328, 2013. ISSN 1226-3192. doi: https://doi.org/10.1016/j.jkss. 2012.10.002

  16. [16]

    S. L. Blodgett, S. Barocas, H. Daum’e, and H. M. Wallach. Language (technology) is power: A critical survey of “bias” in nlp.ArXiv, abs/2005.14050, 2020. URL https: //api.semanticscholar.org/CorpusID:218971825

  17. [17]

    S. L. Blodgett, G. Lopez, A. Olteanu, R. Sim, and H. Wallach. Stereotyping norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, 2021

  18. [18]

    Boelaert, S

    J. Boelaert, S. Coavoux, É. Ollion, I. Petev, and P. Präg. Machine bias. how do generative language models answer opinion polls?Sociological Methods & Research, 54(3):1156–1196, 2025

  19. [19]

    Bolukbasi, K.-W

    T. Bolukbasi, K.-W. Chang, J. Y . Zou, V . Saligrama, and A. T. Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings.Advances in neural information processing systems, 29, 2016

  20. [20]

    Bring Your Own Prompts: Use-Case-Specific Bias and Fairness Evaluation for LLMs

    D. Bouchard. An actionable framework for assessing bias and fairness in large language model use cases, 2024. URLhttps://arxiv.org/abs/2407.10853

  21. [21]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  22. [22]

    Chang, X

    Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

  23. [23]

    Cheng, E

    M. Cheng, E. Durmus, and D. Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language models. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 1504–1532, Toronto, Canada, July

  24. [24]

    doi: 10.18653/v1/2023.acl-long.84

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.84. URL https://aclanthology.org/2023.acl-long.84/

  25. [25]

    Proceedings of the National Academy of Scienceshttps://doi.org/10.1073/pnas.2412015122 (2025) doi:10.1073/pnas.2412015122

    V . Cheung, M. Maier, and F. Lieder. Large language models show amplified cognitive biases in moral decision-making.Proceedings of the National Academy of Sciences, 2025. doi: 10.1073/pnas.2412015122

  26. [26]

    Chiang, L

    W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InInternational Conference on Machine Learning, pages 8359–8388. PMLR, 2024. 11

  27. [27]

    Z. Chu, Z. Wang, and W. Zhang. Fairness in large language models: A taxonomic survey. ACM SIGKDD explorations newsletter, 26(1):34–48, 2024

  28. [28]

    Cohen.Statistical power analysis for the behavioral sciences

    J. Cohen.Statistical power analysis for the behavioral sciences. Routledge, 1988

  29. [29]

    M. J. Crockett and L. Messeri. AI surrogates and illusions of generalizability in cognitive science.Trends in Cognitive Sciences, 2025. doi: 10.1016/j.tics.2025.09.012

  30. [30]

    DeepMind

    G. DeepMind. Gemini 3.1 pro model card, February 2026. URL https://deepmind. google/models/model-cards/gemini-3-1-pro/. Accessed: 2026-05-02

  31. [31]

    Deepseek-v3.2: Pushing the frontier of open large language models, December

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, December

  32. [32]

    URLhttps://arxiv.org/abs/2512.02556

  33. [33]

    Dhamala, T

    J. Dhamala, T. Sun, V . Kumar, S. Krishna, Y . Pruksachatkun, K.-W. Chang, and R. Gupta. BOLD: Dataset and metrics for measuring biases in open-ended language generation. In FAccT, 2021

  34. [34]

    Dillion, N

    D. Dillion, N. Tandon, Y . Gu, and K. Gray. Can AI language models replace human partici- pants?Trends in Cognitive Sciences, 27(7):597–600, 2023. doi: 10.1016/j.tics.2023.04.008

  35. [35]

    arXiv preprint arXiv:2306.16388 , year =

    E. Durmus, K. Nguyen, T. I. Liao, N. Schiefer, A. Askell, A. Bakhtin, C. Chen, Z. Hatfield- Dodds, D. Hernandez, N. Joseph, et al. Towards measuring the representation of subjective global opinions in language models.arXiv preprint arXiv:2306.16388, 2023

  36. [36]

    Eloundou, A

    T. Eloundou, A. Beutel, D. G. Robinson, K. Gu-Lemberg, A.-L. Brakman, P. Mishkin, M. Shah, J. Heidecke, L. Weng, and A. T. Kalai. First-person fairness in chatbots.arXiv preprint arXiv:2410.19803, 2024

  37. [37]

    Esiobu, X

    D. Esiobu, X. Tan, S. Hosseini, M. Ung, Y . Zhang, J. Fernandes, J. Dwivedi-Yu, E. Presani, A. Williams, and E. Smith. Robbie: Robust bias evaluation of large generative language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3764–3814, 2023

  38. [38]

    Felkner, H.-C

    V . Felkner, H.-C. H. Chang, E. Jang, and J. May. WinoQueer: A community-in-the-loop benchmark for anti-LGBTQ+ bias in large language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9126–9140, 2023

  39. [39]

    I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, 2024

  40. [40]

    Garcia, C

    B. Garcia, C. Qian, and S. Palminteri. A moral turing test to assess how subjective belief and objective source affect detection and agreement with llm judgments.PsyArXiv, 2026. doi: 10.31234/osf.io/ct6rx_v3. URLhttps://doi.org/10.31234/osf.io/ct6rx_v3

  41. [41]

    Gehman, S

    S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. InFindings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3356–3369. ACL, 2020

  42. [42]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  43. [43]

    Ghosh, H

    S. Ghosh, H. Frase, A. Williams, S. Luger, P. Röttger, F. Barez, S. McGregor, K. Fricklas, M. Kumar, K. Bollacker, et al. Ailuminate: Introducing v1. 0 of the ai risk and reliability benchmark from mlcommons.arXiv preprint arXiv:2503.05731, 2025

  44. [44]

    GLM-5-Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, et al. Glm-5: from vibe coding to agentic engineering, February 2026. URL https://arxiv. org/abs/2602.15763

  45. [45]

    Gemma 4 model card

    Google DeepMind. Gemma 4 model card. Hugging Face Model Repository, 2026. URL https://huggingface.co/google/gemma-4-26B-A4B. Accessed: 2026-05-05. 12

  46. [46]

    A. G. Greenwald and M. R. Banaji. Implicit social cognition: attitudes, self-esteem, and stereotypes.Psychological review, 102(1):4, 1995

  47. [47]

    A. Group. Qwen3.5-plus: A natively multimodal foundation model built for high-efficiency inference, February 2026. URLhttps://qwen.ai/blog?id=qwen3.5

  48. [48]

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

  49. [49]

    X. Guan, N. Demchak, S. Gupta, Z. Wang, E. Ertekin Jr., A. Koshiyama, E. Kazim, and Z. Wu. SAGED: A holistic bias-benchmarking pipeline for language models with customisable fairness calibration. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, editors,Proceedings of the 31st International Conference on Computationa...

  50. [50]

    Gupta, P

    V . Gupta, P. N. Venkit, S. Wilson, and R. J. Passonneau. Sociodemographic bias in language models: A survey and forward path.arXiv preprint arXiv:2306.08158, 2023

  51. [51]

    Harding, W

    J. Harding, W. D’Alessandro, N. G. Laskowski, and R. Long. AI language models cannot replace human research participants.AI & Society, 39:2603–2605, 2024. doi: 10.1007/ s00146-023-01725-x

  52. [52]

    Hartvigsen, S

    T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3309–3326. ACL, 2022

  53. [53]

    Huang, W

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

  54. [54]

    Huang, Q

    Y . Huang, Q. Zhang, P. S. Yu, and L. Sun. TrustGPT: A benchmark for trustworthy and responsible large language models.ArXiv, abs/2306.11507, 2023

  55. [55]

    Huang, L

    Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Li, C. Gao, Y . Huang, W. Lyu, Y . Zhang, et al. Trustllm: Trustworthiness in large language models. InInternational Conference on Machine Learning, pages 20166–20270. PMLR, 2024

  56. [56]

    P. L. Jeune, J. Liu, L. Rossi, and M. Dora. Realharm: A collection of real-world language model application failures, 2025. URLhttps://arxiv.org/abs/2504.10277

  57. [57]

    P. L. Jeune, B. Malézieux, W. Xiao, and M. Dora. Phare: A safety probe for large language models, 2025. URLhttps://arxiv.org/abs/2505.11365

  58. [58]

    A. Jha, A. Davani, C. K. Reddy, S. Dave, V . Prabhakaran, and S. Dev. SeeGULL: A stereotype benchmark with broad geo-cultural coverage leveraging generative models. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors,Proceedings of the 61st Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), pages 9851–9870, Toront...

  59. [59]

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12): 1–38, 2023

  60. [60]

    Joshi, S

    P. Joshi, S. Santy, A. Budhiraja, K. Bali, and M. Choudhury. The state and fate of linguistic diversity and inclusion in the nlp world. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 6282–6293, 2020

  61. [61]

    D. Jung, S. Lee, H. Moon, C. Park, and H. Lim. FLEX: A benchmark for evaluating robustness of fairness in large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3606–3620, 2025. 13

  62. [62]

    Khamassi, M

    M. Khamassi, M. Nahon, and R. Chatila. Strong and weak alignment of large language models with human values.Scientific Reports, 14, 2024. doi: 10.1038/s41598-024-70031-3

  63. [63]

    H. R. Kirk, Y . Jun, F. V olpin, H. Iqbal, E. Benussi, F. Dreyer, A. Shtedritski, and Y . Asano. Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models.Advances in neural information processing systems, 34:2611– 2624, 2021

  64. [64]

    Kotek, R

    H. Kotek, R. Dockum, and D. Sun. Gender bias and stereotypes in large language models. In Proceedings of the ACM collective intelligence conference, pages 12–24, 2023

  65. [65]

    Krasnod˛ ebska, K

    A. Krasnod˛ ebska, K. Dziewulska, K. Seweryn, M. Chrabaszcz, and W. Kusa. Safety of large language models beyond English: A systematic literature review of risks, biases, and safeguards. In V . Demberg, K. Inui, and L. Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Lo...

  66. [66]

    P. M. Kroonenberg and A. Verbeek. The tale of cochran’s rule: My contingency table has so many expected values smaller than 5, what am i to do?The American Statistician, 72(2): 175–183, 2018. doi: 10.1080/00031305.2017.1286260

  67. [67]

    Kurita, N

    K. Kurita, N. Vyas, A. Pareek, A. W. Black, and Y . Tsvetkov. Measuring bias in contextualized word representations.arXiv preprint arXiv:1906.07337, 2019

  68. [68]

    B. Li, Y . Ge, Y . Ge, G. Wang, R. Wang, R. Zhang, and Y . Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024

  69. [69]

    T. Li, D. Khashabi, T. Khot, A. Sabharwal, and V . Srikumar. Unqovering stereotyping biases via underspecified questions. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3475–3489, 2020

  70. [71]

    X. Li, Z. Chen, J. M. Zhang, Y . Lou, T. Li, W. Sun, Y . Liu, and X. Liu. Benchmarking bias in large language models during role-playing.arXiv preprint arXiv:2411.00585, 2024

  71. [72]

    Y . Li, M. Du, R. Song, X. Wang, and Y . Wang. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149, 2023

  72. [73]

    Liang, R

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. A. Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. WANG, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. G...

  73. [74]

    S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3214–3252. ACL, 2022

  74. [75]

    Z. Liu. Evaluating and mitigating social bias for large language models in open-ended settings. arXiv preprint arXiv:2412.06134, 2024

  75. [76]

    C. May, A. Wang, S. Bordia, S. Bowman, and R. Rudinger. On measuring social biases in sentence encoders. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 622–628, 2019. 14

  76. [77]

    Mazeika, L

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  77. [78]

    Minimax-m2.5 technical report

    MiniMax. Minimax-m2.5 technical report. Technical report, MiniMaxAI, October 2025. URL https://www.minimax.io/news/minimax-m25. Accessed: 2026-05-02

  78. [79]

    Mirza, R

    V . Mirza, R. Kulkarni, and A. Jadhav. Evaluating gender, racial, and age biases in large language models: A comparative analysis of occupational and crime scenarios. In2025 IEEE Conference on Artificial Intelligence (CAI), pages 244–251. IEEE, 2025

  79. [80]

    Mitchell, G

    M. Mitchell, G. Attanasio, I. Baldini, M. Clinciu, J. Clive, P. Delobelle, M. Dey, S. Hamilton, T. Dill, J. Doughman, et al. Shades: Towards a multilingual assessment of stereotypes in large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo...

  80. [81]

    Morlat, M

    G. Morlat, M. Nahon, A. Chartouny, R. Chatila, I. T. Freire, and M. Khamassi. COMETH: Contextual organization of moral evaluation from textual human inputs.arXiv preprint arXiv:2512.21439, 2025

Showing first 80 references.