pith. machine review for the scientific record. sign in

arxiv: 2604.20677 · v2 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

Intersectional Fairness in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords intersectional fairnesslarge language modelsstereotype alignmentsubgroup fairnessbias evaluationLLM consistencydemographic biasambiguous contexts
0
0 comments X

The pith

No large language model achieves consistent fairness or reliability across intersecting demographic groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six large language models on intersectional fairness using ambiguous and disambiguated contexts drawn from two benchmark datasets. It tracks bias scores, subgroup fairness, accuracy, and consistency over multiple runs while varying question polarity. Models appear competent in ambiguous settings, yet this reduces the usefulness of fairness metrics because few predictions are decisive. In disambiguated settings, accuracy rises when the correct answer matches a stereotype and falls when it contradicts one, with the effect strongest at race-gender intersections. Subgroup metrics show uneven outcome spreads even when raw disparity looks small, and repeated runs produce inconsistent answers that sometimes reinforce stereotypes.

Core claim

The central claim is that modern LLMs display stereotype-aligned behavior in intersectional settings: accuracy improves when the right answer fits existing stereotypes, especially for race-gender pairs, while subgroup fairness metrics reveal persistently uneven distributions across groups and responses fluctuate across repeated runs. No model maintains both high accuracy and even, consistent outcomes when demographic attributes intersect. The authors therefore conclude that competence in these tests is partly tied to stereotype-consistent cues and that no evaluated model reaches reliable fairness across intersectional conditions.

What carries the argument

The multi-metric evaluation protocol that compares bias scores, subgroup fairness, accuracy, and run-to-run consistency on ambiguous versus disambiguated contexts for intersecting demographic attributes.

If this is right

  • Accuracy is higher when the correct answer aligns with a stereotype than when it contradicts one.
  • The alignment effect is strongest for race-gender intersections.
  • Subgroup fairness metrics can report low disparity while outcome distributions remain uneven across intersectional groups.
  • Responses vary in consistency across repeated runs and can include stereotype-aligned answers.
  • Fairness evaluation must combine bias scores, subgroup metrics, and consistency checks rather than rely on accuracy alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that depend on stereotype cues for higher accuracy may produce systematically skewed decisions in applications such as hiring screening or clinical summarization.
  • Requiring models to maintain consistency across runs would expose additional fairness failures not visible in single-pass tests.
  • New training objectives that penalize accuracy gains tied to stereotype alignment could be tested directly against the same ambiguous/disambiguated split.

Load-bearing premise

The selected benchmark datasets and fairness metrics capture the main real-world intersectional fairness problems that matter for deployed language models.

What would settle it

An experiment in which at least one LLM shows equal accuracy on stereotype-aligned and stereotype-contradicting answers across all tested intersections, produces even outcome distributions in every subgroup, and returns identical decisions on repeated runs for both ambiguous and disambiguated questions.

Figures

Figures reproduced from arXiv: 2604.20677 by Ann Barcomb, Chaima Boufaied, Ronnie de Souza Santos.

Figure 2
Figure 2. Figure 2: Ambiguous Context (Negative Polarity) Gemma-3 Llama-3 claude-sonnet-4 gemini2.0flash gemini2.5flash gpt4o Model 0 10 20 30 40 50 60 Number of Questions Wrongly Answered 3 58 0 0 1 0 [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ambiguous Context (Non-Negative Polarity) [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper evaluates intersectional fairness across six LLMs on ambiguous and disambiguated contexts drawn from two benchmark datasets. It reports results on bias scores, subgroup fairness, accuracy, and multi-run consistency (including negative/non-negative polarities), concluding that modern LLMs show competence in ambiguous settings that limits metric informativeness, that accuracy tracks stereotype alignment (especially for race-gender intersections), and that no evaluated model exhibits consistently reliable or fair behavior across intersectional groups.

Significance. If the central empirical patterns hold, the work usefully demonstrates that accuracy alone is an insufficient proxy for intersectional fairness and that combining bias, subgroup, and consistency metrics across contexts and repeated runs reveals persistent limitations. The multi-metric, multi-run design is a methodological strength that could support more robust future evaluations.

major comments (3)
  1. [Abstract] Abstract: the headline claim that 'no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings' is load-bearing yet rests on the untested assumption that the two (unnamed) benchmark datasets and the composite of bias scores, subgroup fairness, accuracy, and consistency are sufficient proxies. The abstract itself notes that ambiguous contexts produce 'sparse non-unknown predictions' that limit metric informativeness, directly weakening support for the universal negative conclusion.
  2. [Abstract and §3] Abstract and §3 (Datasets and Metrics): the observation that accuracy is higher when the correct answer aligns with a stereotype (especially race-gender) could be an artifact of benchmark construction if the 'correct answer' labels or context distributions embed the same stereotypes being measured. No evidence is provided that the metrics remain stable under prompt variation or dataset expansion, which is required to treat the pattern as a general model property rather than a dataset-specific effect.
  3. [§4] §4 (Results): the subgroup fairness metrics are reported to show 'low observed disparity in some cases' yet 'uneven outcome distributions.' Without the exact definitions of the subgroup fairness metrics, the statistical tests used, or controls for multiple comparisons across intersectional pairs, it is unclear whether the reported unevenness is statistically reliable or merely descriptive.
minor comments (2)
  1. [Abstract] The abstract refers to 'negative and non-negative question polarities' without a concise definition or example; adding one sentence of clarification would improve readability.
  2. [§4] Table or figure captions for the multi-run consistency results should explicitly state the number of runs and the exact consistency metric (e.g., majority vote, entropy) to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below with point-by-point responses, indicating where we will make revisions to improve clarity, rigor, and transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that 'no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings' is load-bearing yet rests on the untested assumption that the two (unnamed) benchmark datasets and the composite of bias scores, subgroup fairness, accuracy, and consistency are sufficient proxies. The abstract itself notes that ambiguous contexts produce 'sparse non-unknown predictions' that limit metric informativeness, directly weakening support for the universal negative conclusion.

    Authors: We appreciate the referee's point regarding the scope of our headline claim. The conclusion is drawn from results across the two benchmark datasets and the multi-metric framework (bias scores, subgroup fairness, accuracy, and multi-run consistency) described in the paper. We explicitly note the limitations of ambiguous contexts in the abstract and focus the strongest claims on disambiguated contexts where predictions are more informative. To address the concern about framing, we will revise the abstract to qualify the claim as applying to the evaluated models on these specific benchmarks and metrics, rather than as a universal statement across all possible datasets or settings. revision: yes

  2. Referee: [Abstract and §3] Abstract and §3 (Datasets and Metrics): the observation that accuracy is higher when the correct answer aligns with a stereotype (especially race-gender) could be an artifact of benchmark construction if the 'correct answer' labels or context distributions embed the same stereotypes being measured. No evidence is provided that the metrics remain stable under prompt variation or dataset expansion, which is required to treat the pattern as a general model property rather than a dataset-specific effect.

    Authors: We acknowledge that the observed accuracy-stereotype alignment pattern could be influenced by the specific construction of the benchmarks, and that broader validation would be valuable. The datasets are established fairness benchmarks with ground-truth answers derived from the provided contexts. Our multi-run analysis across negative and non-negative polarities provides evidence of consistency within the current setup, but we did not perform systematic prompt variations or test on expanded datasets. We will revise §3 to include more detail on dataset construction and add a limitations discussion noting the need for future robustness checks under prompt variation and dataset expansion. This will clarify the scope of the current findings. revision: partial

  3. Referee: [§4] §4 (Results): the subgroup fairness metrics are reported to show 'low observed disparity in some cases' yet 'uneven outcome distributions.' Without the exact definitions of the subgroup fairness metrics, the statistical tests used, or controls for multiple comparisons across intersectional pairs, it is unclear whether the reported unevenness is statistically reliable or merely descriptive.

    Authors: We thank the referee for identifying this gap in presentation. We will revise §4 to include the exact mathematical definitions and formulas for all subgroup fairness metrics, specify the statistical tests used to evaluate disparities (including significance levels), and apply appropriate corrections for multiple comparisons across intersectional pairs. These additions will allow readers to assess whether the uneven outcome distributions reflect statistically reliable patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or self-referential reductions

full rationale

The paper reports results from running six LLMs on two public benchmark datasets, measuring bias scores, subgroup fairness, accuracy, and consistency across contexts and polarities. No equations, fitted parameters, or derivation steps are present that could reduce outputs to inputs by construction. The central claim follows directly from the observed empirical patterns rather than from any self-definition, ansatz, or self-citation load-bearing step. Self-citations, if any, are not required to support the evaluation methodology itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard fairness metrics and existing benchmark datasets without introducing new fitted parameters, axioms beyond domain conventions, or invented entities.

axioms (1)
  • domain assumption Standard subgroup fairness metrics (e.g., demographic parity or equalized odds) are appropriate for evaluating LLM output distributions across intersectional groups.
    Invoked when reporting subgroup fairness metrics and outcome distributions.

pith-pipeline@v0.9.0 · 5546 in / 1239 out tokens · 35287 ms · 2026-05-09T23:38:34.300552+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    https://figshare.com/s/62135c4835b127ab376e Agency for Healthcare Research and Quality

    2026.Replication Package for Intersectional Fairness in Large Language Models. https://figshare.com/s/62135c4835b127ab376e Agency for Healthcare Research and Quality

  2. [2]

    Retrieved from https://meps.ahrq.gov/mepsweb/ data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181

    MEPS HC-181: 2015 Full Year Consolidated Data File. Retrieved from https://meps.ahrq.gov/mepsweb/ data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181. Luciano Baresi, Chiara Criscuolo, and Carlo Ghezzi

  3. [3]

    Cléa Chataigner, Rebecca Ma, Prakhar Ganesh, Afaf Ta¨"ık, Elliot Creager, and Golnoosh Farnadi

    Practitioner Insights on Fairness Requirements in the AI Development Life Cycle: An Interview Study.arXiv preprint arXiv:2512.13830(2025). Cléa Chataigner, Rebecca Ma, Prakhar Ganesh, Afaf Ta¨"ık, Elliot Creager, and Golnoosh Farnadi

  4. [4]

    Zhenpeng Chen, Jie M Zhang, Max Hort, Mark Harman, and Federica Sarro

    Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework.arXiv preprint arXiv:2505.03563(2025). Zhenpeng Chen, Jie M Zhang, Max Hort, Mark Harman, and Federica Sarro. 2024a. Fairness testing: A comprehensive survey and analysis of trends. ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–59. Under Revie...

  5. [5]

    Kimberlé Crenshaw

    Socially responsible ai algorithms: Issues, purposes, and challenges.Journal of Artificial Intelligence Research71 (2021), 1137–1181. Kimberlé Crenshaw

  6. [6]

    Ramandeep Singh Dehal, Mehak Sharma, and Ronnie de Souza Santos

    Software fairness debt: Building a research agenda for addressing bias in AI systems.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–21. Ramandeep Singh Dehal, Mehak Sharma, and Ronnie de Souza Santos

  7. [7]

    Wesley Hanwen Deng, Nur Yildirim, Monica Chang, Motahhare Eslami, Kenneth Holstein, and Michael Madaio

    Algorithmic fairness: challenges to building an effective regulatory regime.Frontiers in Artificial Intelligence8 (2025), 1637134. Wesley Hanwen Deng, Nur Yildirim, Monica Chang, Motahhare Eslami, Kenneth Holstein, and Michael Madaio

  8. [8]

    InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency

    Investigating practices and opportunities for cross-functional collaboration around AI fairness in industry practice. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 705–716. Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta

  9. [9]

    Fairness-aware machine learning engineering: how far are we?Empirical software engineering29, 1 (2024),

  10. [10]

    In2020 IEEE 36th international conference on data engineering (ICDE)

    An intersectional definition of fairness. In2020 IEEE 36th international conference on data engineering (ICDE). IEEE, 1918–1921. Batool Haider, Atmika Gorti, Aman Chadha, and Manas Gaur

  11. [11]

    Rem Hida, Masahiro Kaneko, and Naoaki Okazaki

    Mental Health Equity in LLMs: Leveraging Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives.arXiv preprint arXiv:2506.18116(2025). Rem Hida, Masahiro Kaneko, and Naoaki Okazaki

  12. [12]

    Yufei Huang and Deyi Xiong

    Social bias evaluation for large language models requires prompt variations.arXiv preprint arXiv:2407.03129(2024). Yufei Huang and Deyi Xiong

  13. [13]

    In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

    CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2917–2929. Jiho Jin, Woosung Kang, Junho Myung, and Alice Oh

  14. [14]

    arXiv preprint arXiv:2503.06987(2025)

    Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations. arXiv preprint arXiv:2503.06987(2025). Jiho Jin, Jiseon Kim, Nayeon Lee, Haneul Yoo, Alice Oh, and Hwaran Lee

  15. [15]

    Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu

    KoBBQ: Korean bias benchmark for question answering.Transactions of the Association for Computational Linguistics12 (2024), 507–524. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu

  16. [16]

    Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin

    Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213. Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin

  17. [17]

    Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang

    Exploring the Impact of Temperature on Large Language Models: Hot or Cold?Procedia Computer Science264 (2025), 242–251. Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang

  18. [18]

    A survey on fairness in large language models

    A survey on fairness in large language models.arXiv preprint arXiv:2308.10149(2023). Zhao Liu, Tian Xie, and Xueru Zhang

  19. [19]

    Qinghua Lu, Liming Zhu, Xiwei Xu, Jon Whittle, and Zhenchang Xing

    Evaluating and mitigating social bias for large language models in open-ended settings.arXiv preprint arXiv:2412.06134(2024). Qinghua Lu, Liming Zhu, Xiwei Xu, Jon Whittle, and Zhenchang Xing

  20. [20]

    Moin Nadeem, Anna Bethke, and Siva Reddy

    FairST: A novel approach for machine learning bias repair through latent sensitive attribute translation.Information and Software Technology(2025), 107900. Moin Nadeem, Anna Bethke, and Siva Reddy

  21. [21]

    InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1953–1967. Vishal Narnaware, Ashmal Vayani, Rohit Gupta, Sirnam Swetha, and Mubarak Shah

  22. [22]

    Cathy O’Neil

    Sb-bench: Stereotype bias benchmark for large multimodal models.arXiv preprint arXiv:2502.08779(2025). Cathy O’Neil. 2017.Weapons of math destruction: How big data increases inequality and threatens democracy. Crown. Under Review 24 Chaima Boufaied, Ronnie de Souza Santos, and Ann Barcomb Aastha Pant, Rashina Hoda, Chakkrit Tantithamthavorn, and Burak Turhan

  23. [23]

    Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman

    Navigating fairness: practitioners’ understanding, challenges, and strategies in AI/ML development.Empirical Software Engineering30, 3 (2025), 1–38. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman

  24. [24]

    Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman

    2086–2105. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman

  25. [25]

    BBQ: A Hand-Built Bias Benchmark for Question Answering

    BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193(2021). Nga Pham, Hung Pham Ngoc, and Anh Nguyen-Duc

  26. [26]

    Qusai Ramadan, Jukka Ruohonen, Abhishek Tiwari, Adam Alami, and Zeyd Boukhers

    Fairness for machine learning software in education: A systematic mapping study.Journal of Systems and Software219 (2025), 112244. Qusai Ramadan, Jukka Ruohonen, Abhishek Tiwari, Adam Alami, and Zeyd Boukhers

  27. [27]

    Seamus Ryan, Wanling Cai, Robert Bowman, and Gavin Doherty

    Towards Systematic Specification and Verification of Fairness Requirements: A Position Paper.arXiv preprint arXiv:2509.20387(2025). Seamus Ryan, Wanling Cai, Robert Bowman, and Gavin Doherty

  28. [28]

    ACM Transactions on Computing for Healthcare6, 4 (2025), 1–26

    Fairness Challenges in the Design of Machine Learning Applications for Healthcare. ACM Transactions on Computing for Healthcare6, 4 (2025), 1–26. Seamus Ryan, Camille Nadal, and Gavin Doherty

  29. [29]

    IEEE Access11 (2023), 29296–29313

    Integrating fairness in the software design process: An interview study with hci and ml experts. IEEE Access11 (2023), 29296–29313. Hamidreza Saffari, Mohammadamin Shafiei, Donya Rooein, and Debora Nozza

  30. [30]

    In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

    The Perspective of Software Professionals on Algorithmic Racism. In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–10. Xabier Saralegi and Muitze Zulaika

  31. [31]

    Jessie J Smith, Michael Madaio, Robin Burke, and Casey Fiesler

    Parity benchmark for measuring bias in LLMs.AI and Ethics(2024), 1–15. Jessie J Smith, Michael Madaio, Robin Burke, and Casey Fiesler

  32. [32]

    In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency

    Pragmatic Fairness: Evaluating ML Fairness Within the Constraints of Industry. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 628–638. Ezekiel Soremekun, Mike Papadakis, Maxime Cordy, and Yves Le Traon

  33. [33]

    UCI Machine Learning Repository

    BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context.arXiv preprint arXiv:2508.07090(2025). UCI Machine Learning Repository

  34. [34]

    Yinong Oliver Wang, Nivedha Sivakumar, Falaah Arif Khan, Rin Metcalf Susa, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, and Nicholas Apostoloff

    A catalog of fairness-aware practices in machine learning engineering.arXiv preprint arXiv:2408.16683(2024). Yinong Oliver Wang, Nivedha Sivakumar, Falaah Arif Khan, Rin Metcalf Susa, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, and Nicholas Apostoloff

  35. [35]

    Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, and Yi Fang

    Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs.arXiv preprint arXiv:2505.23996(2025). Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, and Yi Fang

  36. [36]

    Zhenjie Xu, Wenqing Chen, Yi Tang, Xuanying Li, Cheng Hu, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu

    Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning.arXiv preprint arXiv:2502.15361(2025). Zhenjie Xu, Wenqing Chen, Yi Tang, Xuanying Li, Cheng Hu, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu

  37. [37]

    arXiv preprint arXiv:2503.09219(2025)

    Rethinking Prompt-based Debiasing in Large Language Models. arXiv preprint arXiv:2503.09219(2025). Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang

  38. [38]

    https://paperswithcode.com/dataset/winobias

    The WinoBias Dataset. https://paperswithcode.com/dataset/winobias. Accessed: 2025-07-22. Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing

  39. [39]

    Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity

    Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867(2023)