arxiv: 2604.20677 · v2 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

Intersectional Fairness in Large Language Models

Chaima Boufaied , Ronnie de Souza Santos , Ann Barcomb

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords intersectional fairnesslarge language modelsstereotype alignmentsubgroup fairnessbias evaluationLLM consistencydemographic biasambiguous contexts

0 comments

The pith

No large language model achieves consistent fairness or reliability across intersecting demographic groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six large language models on intersectional fairness using ambiguous and disambiguated contexts drawn from two benchmark datasets. It tracks bias scores, subgroup fairness, accuracy, and consistency over multiple runs while varying question polarity. Models appear competent in ambiguous settings, yet this reduces the usefulness of fairness metrics because few predictions are decisive. In disambiguated settings, accuracy rises when the correct answer matches a stereotype and falls when it contradicts one, with the effect strongest at race-gender intersections. Subgroup metrics show uneven outcome spreads even when raw disparity looks small, and repeated runs produce inconsistent answers that sometimes reinforce stereotypes.

Core claim

The central claim is that modern LLMs display stereotype-aligned behavior in intersectional settings: accuracy improves when the right answer fits existing stereotypes, especially for race-gender pairs, while subgroup fairness metrics reveal persistently uneven distributions across groups and responses fluctuate across repeated runs. No model maintains both high accuracy and even, consistent outcomes when demographic attributes intersect. The authors therefore conclude that competence in these tests is partly tied to stereotype-consistent cues and that no evaluated model reaches reliable fairness across intersectional conditions.

What carries the argument

The multi-metric evaluation protocol that compares bias scores, subgroup fairness, accuracy, and run-to-run consistency on ambiguous versus disambiguated contexts for intersecting demographic attributes.

If this is right

Accuracy is higher when the correct answer aligns with a stereotype than when it contradicts one.
The alignment effect is strongest for race-gender intersections.
Subgroup fairness metrics can report low disparity while outcome distributions remain uneven across intersectional groups.
Responses vary in consistency across repeated runs and can include stereotype-aligned answers.
Fairness evaluation must combine bias scores, subgroup metrics, and consistency checks rather than rely on accuracy alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that depend on stereotype cues for higher accuracy may produce systematically skewed decisions in applications such as hiring screening or clinical summarization.
Requiring models to maintain consistency across runs would expose additional fairness failures not visible in single-pass tests.
New training objectives that penalize accuracy gains tied to stereotype alignment could be tested directly against the same ambiguous/disambiguated split.

Load-bearing premise

The selected benchmark datasets and fairness metrics capture the main real-world intersectional fairness problems that matter for deployed language models.

What would settle it

An experiment in which at least one LLM shows equal accuracy on stereotype-aligned and stereotype-contradicting answers across all tested intersections, produces even outcome distributions in every subgroup, and returns identical decisions on repeated runs for both ambiguous and disambiguated questions.

Figures

Figures reproduced from arXiv: 2604.20677 by Ann Barcomb, Chaima Boufaied, Ronnie de Souza Santos.

**Figure 2.** Figure 2: Ambiguous Context (Negative Polarity) Gemma-3 Llama-3 claude-sonnet-4 gemini2.0flash gemini2.5flash gpt4o Model 0 10 20 30 40 50 60 Number of Questions Wrongly Answered 3 58 0 0 1 0 [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 4.** Figure 4: Ambiguous Context (Non-Negative Polarity) [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs show higher accuracy on stereotype-aligned answers for intersectional groups but the no-consistent-fairness claim depends on benchmarks whose limitations the abstract itself flags.

read the letter

The main takeaway is that the paper tests six LLMs on intersectional fairness and finds that none of them show consistent reliability or fairness across the settings they examined. Accuracy tends to be higher when the correct answer aligns with a stereotype, particularly in race-gender intersections, and responses show inconsistency across repeated runs. What is new here is the systematic combination of several metrics—bias scores, subgroup fairness, accuracy, and consistency—applied to intersectional attributes using both ambiguous and disambiguated contexts from two benchmarks, plus checks across question polarities. This goes beyond many single-attribute bias studies by adding the multi-run and polarity elements. The paper does well in highlighting a practical issue: in ambiguous contexts, models often give non-unknown answers that are sparse, which makes the fairness metrics less informative. The disambiguated results showing stereotype influence on accuracy is a clear empirical observation worth noting. Where it is softer is in supporting the broad claim that no evaluated LLM achieves consistently reliable or fair behavior. This rests on the two unspecified benchmark datasets and the composite metrics being good enough proxies for real-world issues. The abstract points out the limitations in ambiguous cases, and there is a real risk that the datasets over-represent certain pairs or that the correct labels embed the stereotypes being tested. Without full access to methods, data splits, metric definitions, and any statistical tests, the patterns are plausible but not fully verified. The uneven distributions are noted, but low disparity in some cases suggests the metrics may not always capture the problems sharply. This work is for people doing fairness audits on LLMs or studying intersectional bias. A reader interested in multi-metric evaluation would find the approach useful. I would recommend sending it for peer review. The topic matters and the empirical data adds to the conversation, though revisions would likely focus on clarifying the robustness of the benchmarks and the strength of the conclusions.

Referee Report

3 major / 2 minor

Summary. The paper evaluates intersectional fairness across six LLMs on ambiguous and disambiguated contexts drawn from two benchmark datasets. It reports results on bias scores, subgroup fairness, accuracy, and multi-run consistency (including negative/non-negative polarities), concluding that modern LLMs show competence in ambiguous settings that limits metric informativeness, that accuracy tracks stereotype alignment (especially for race-gender intersections), and that no evaluated model exhibits consistently reliable or fair behavior across intersectional groups.

Significance. If the central empirical patterns hold, the work usefully demonstrates that accuracy alone is an insufficient proxy for intersectional fairness and that combining bias, subgroup, and consistency metrics across contexts and repeated runs reveals persistent limitations. The multi-metric, multi-run design is a methodological strength that could support more robust future evaluations.

major comments (3)

[Abstract] Abstract: the headline claim that 'no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings' is load-bearing yet rests on the untested assumption that the two (unnamed) benchmark datasets and the composite of bias scores, subgroup fairness, accuracy, and consistency are sufficient proxies. The abstract itself notes that ambiguous contexts produce 'sparse non-unknown predictions' that limit metric informativeness, directly weakening support for the universal negative conclusion.
[Abstract and §3] Abstract and §3 (Datasets and Metrics): the observation that accuracy is higher when the correct answer aligns with a stereotype (especially race-gender) could be an artifact of benchmark construction if the 'correct answer' labels or context distributions embed the same stereotypes being measured. No evidence is provided that the metrics remain stable under prompt variation or dataset expansion, which is required to treat the pattern as a general model property rather than a dataset-specific effect.
[§4] §4 (Results): the subgroup fairness metrics are reported to show 'low observed disparity in some cases' yet 'uneven outcome distributions.' Without the exact definitions of the subgroup fairness metrics, the statistical tests used, or controls for multiple comparisons across intersectional pairs, it is unclear whether the reported unevenness is statistically reliable or merely descriptive.

minor comments (2)

[Abstract] The abstract refers to 'negative and non-negative question polarities' without a concise definition or example; adding one sentence of clarification would improve readability.
[§4] Table or figure captions for the multi-run consistency results should explicitly state the number of runs and the exact consistency metric (e.g., majority vote, entropy) to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below with point-by-point responses, indicating where we will make revisions to improve clarity, rigor, and transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 'no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings' is load-bearing yet rests on the untested assumption that the two (unnamed) benchmark datasets and the composite of bias scores, subgroup fairness, accuracy, and consistency are sufficient proxies. The abstract itself notes that ambiguous contexts produce 'sparse non-unknown predictions' that limit metric informativeness, directly weakening support for the universal negative conclusion.

Authors: We appreciate the referee's point regarding the scope of our headline claim. The conclusion is drawn from results across the two benchmark datasets and the multi-metric framework (bias scores, subgroup fairness, accuracy, and multi-run consistency) described in the paper. We explicitly note the limitations of ambiguous contexts in the abstract and focus the strongest claims on disambiguated contexts where predictions are more informative. To address the concern about framing, we will revise the abstract to qualify the claim as applying to the evaluated models on these specific benchmarks and metrics, rather than as a universal statement across all possible datasets or settings. revision: yes
Referee: [Abstract and §3] Abstract and §3 (Datasets and Metrics): the observation that accuracy is higher when the correct answer aligns with a stereotype (especially race-gender) could be an artifact of benchmark construction if the 'correct answer' labels or context distributions embed the same stereotypes being measured. No evidence is provided that the metrics remain stable under prompt variation or dataset expansion, which is required to treat the pattern as a general model property rather than a dataset-specific effect.

Authors: We acknowledge that the observed accuracy-stereotype alignment pattern could be influenced by the specific construction of the benchmarks, and that broader validation would be valuable. The datasets are established fairness benchmarks with ground-truth answers derived from the provided contexts. Our multi-run analysis across negative and non-negative polarities provides evidence of consistency within the current setup, but we did not perform systematic prompt variations or test on expanded datasets. We will revise §3 to include more detail on dataset construction and add a limitations discussion noting the need for future robustness checks under prompt variation and dataset expansion. This will clarify the scope of the current findings. revision: partial
Referee: [§4] §4 (Results): the subgroup fairness metrics are reported to show 'low observed disparity in some cases' yet 'uneven outcome distributions.' Without the exact definitions of the subgroup fairness metrics, the statistical tests used, or controls for multiple comparisons across intersectional pairs, it is unclear whether the reported unevenness is statistically reliable or merely descriptive.

Authors: We thank the referee for identifying this gap in presentation. We will revise §4 to include the exact mathematical definitions and formulas for all subgroup fairness metrics, specify the statistical tests used to evaluate disparities (including significance levels), and apply appropriate corrections for multiple comparisons across intersectional pairs. These additions will allow readers to assess whether the uneven outcome distributions reflect statistically reliable patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or self-referential reductions

full rationale

The paper reports results from running six LLMs on two public benchmark datasets, measuring bias scores, subgroup fairness, accuracy, and consistency across contexts and polarities. No equations, fitted parameters, or derivation steps are present that could reduce outputs to inputs by construction. The central claim follows directly from the observed empirical patterns rather than from any self-definition, ansatz, or self-citation load-bearing step. Self-citations, if any, are not required to support the evaluation methodology itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard fairness metrics and existing benchmark datasets without introducing new fitted parameters, axioms beyond domain conventions, or invented entities.

axioms (1)

domain assumption Standard subgroup fairness metrics (e.g., demographic parity or equalized odds) are appropriate for evaluating LLM output distributions across intersectional groups.
Invoked when reporting subgroup fairness metrics and outcome distributions.

pith-pipeline@v0.9.0 · 5546 in / 1239 out tokens · 35287 ms · 2026-05-09T23:38:34.300552+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 16 canonical work pages · 1 internal anchor

[1]

https://figshare.com/s/62135c4835b127ab376e Agency for Healthcare Research and Quality

2026.Replication Package for Intersectional Fairness in Large Language Models. https://figshare.com/s/62135c4835b127ab376e Agency for Healthcare Research and Quality

2026
[2]

Retrieved from https://meps.ahrq.gov/mepsweb/ data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181

MEPS HC-181: 2015 Full Year Consolidated Data File. Retrieved from https://meps.ahrq.gov/mepsweb/ data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181. Luciano Baresi, Chiara Criscuolo, and Carlo Ghezzi

2015
[3]

Cléa Chataigner, Rebecca Ma, Prakhar Ganesh, Afaf Ta¨"ık, Elliot Creager, and Golnoosh Farnadi

Practitioner Insights on Fairness Requirements in the AI Development Life Cycle: An Interview Study.arXiv preprint arXiv:2512.13830(2025). Cléa Chataigner, Rebecca Ma, Prakhar Ganesh, Afaf Ta¨"ık, Elliot Creager, and Golnoosh Farnadi

work page arXiv 2025
[4]

Zhenpeng Chen, Jie M Zhang, Max Hort, Mark Harman, and Federica Sarro

Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework.arXiv preprint arXiv:2505.03563(2025). Zhenpeng Chen, Jie M Zhang, Max Hort, Mark Harman, and Federica Sarro. 2024a. Fairness testing: A comprehensive survey and analysis of trends. ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–59. Under Revie...

work page arXiv 2025
[5]

Kimberlé Crenshaw

Socially responsible ai algorithms: Issues, purposes, and challenges.Journal of Artificial Intelligence Research71 (2021), 1137–1181. Kimberlé Crenshaw

2021
[6]

Ramandeep Singh Dehal, Mehak Sharma, and Ronnie de Souza Santos

Software fairness debt: Building a research agenda for addressing bias in AI systems.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–21. Ramandeep Singh Dehal, Mehak Sharma, and Ronnie de Souza Santos

2025
[7]

Wesley Hanwen Deng, Nur Yildirim, Monica Chang, Motahhare Eslami, Kenneth Holstein, and Michael Madaio

Algorithmic fairness: challenges to building an effective regulatory regime.Frontiers in Artificial Intelligence8 (2025), 1637134. Wesley Hanwen Deng, Nur Yildirim, Monica Chang, Motahhare Eslami, Kenneth Holstein, and Michael Madaio

2025
[8]

InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency

Investigating practices and opportunities for cross-functional collaboration around AI fairness in industry practice. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 705–716. Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta

2023
[9]

Fairness-aware machine learning engineering: how far are we?Empirical software engineering29, 1 (2024),

2024
[10]

In2020 IEEE 36th international conference on data engineering (ICDE)

An intersectional definition of fairness. In2020 IEEE 36th international conference on data engineering (ICDE). IEEE, 1918–1921. Batool Haider, Atmika Gorti, Aman Chadha, and Manas Gaur

1918
[11]

Rem Hida, Masahiro Kaneko, and Naoaki Okazaki

Mental Health Equity in LLMs: Leveraging Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives.arXiv preprint arXiv:2506.18116(2025). Rem Hida, Masahiro Kaneko, and Naoaki Okazaki

work page arXiv 2025
[12]

Yufei Huang and Deyi Xiong

Social bias evaluation for large language models requires prompt variations.arXiv preprint arXiv:2407.03129(2024). Yufei Huang and Deyi Xiong

work page arXiv 2024
[13]

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2917–2929. Jiho Jin, Woosung Kang, Junho Myung, and Alice Oh

2024
[14]

arXiv preprint arXiv:2503.06987(2025)

Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations. arXiv preprint arXiv:2503.06987(2025). Jiho Jin, Jiseon Kim, Nayeon Lee, Haneul Yoo, Alice Oh, and Hwaran Lee

work page arXiv 2025
[15]

Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu

KoBBQ: Korean bias benchmark for question answering.Transactions of the Association for Computational Linguistics12 (2024), 507–524. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu

2024
[16]

Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin

Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213. Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin

2022
[17]

Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang

Exploring the Impact of Temperature on Large Language Models: Hot or Cold?Procedia Computer Science264 (2025), 242–251. Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang

2025
[18]

A survey on fairness in large language models

A survey on fairness in large language models.arXiv preprint arXiv:2308.10149(2023). Zhao Liu, Tian Xie, and Xueru Zhang

work page arXiv 2023
[19]

Qinghua Lu, Liming Zhu, Xiwei Xu, Jon Whittle, and Zhenchang Xing

Evaluating and mitigating social bias for large language models in open-ended settings.arXiv preprint arXiv:2412.06134(2024). Qinghua Lu, Liming Zhu, Xiwei Xu, Jon Whittle, and Zhenchang Xing

work page arXiv 2024
[20]

Moin Nadeem, Anna Bethke, and Siva Reddy

FairST: A novel approach for machine learning bias repair through latent sensitive attribute translation.Information and Software Technology(2025), 107900. Moin Nadeem, Anna Bethke, and Siva Reddy

2025
[21]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1953–1967. Vishal Narnaware, Ashmal Vayani, Rohit Gupta, Sirnam Swetha, and Mubarak Shah

2020
[22]

Cathy O’Neil

Sb-bench: Stereotype bias benchmark for large multimodal models.arXiv preprint arXiv:2502.08779(2025). Cathy O’Neil. 2017.Weapons of math destruction: How big data increases inequality and threatens democracy. Crown. Under Review 24 Chaima Boufaied, Ronnie de Souza Santos, and Ann Barcomb Aastha Pant, Rashina Hoda, Chakkrit Tantithamthavorn, and Burak Turhan

work page arXiv 2025
[23]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman

Navigating fairness: practitioners’ understanding, challenges, and strategies in AI/ML development.Empirical Software Engineering30, 3 (2025), 1–38. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman

2025
[24]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman

2086–2105. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman

2086
[25]

BBQ: A Hand-Built Bias Benchmark for Question Answering

BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193(2021). Nga Pham, Hung Pham Ngoc, and Anh Nguyen-Duc

work page internal anchor Pith review arXiv 2021
[26]

Qusai Ramadan, Jukka Ruohonen, Abhishek Tiwari, Adam Alami, and Zeyd Boukhers

Fairness for machine learning software in education: A systematic mapping study.Journal of Systems and Software219 (2025), 112244. Qusai Ramadan, Jukka Ruohonen, Abhishek Tiwari, Adam Alami, and Zeyd Boukhers

2025
[27]

Seamus Ryan, Wanling Cai, Robert Bowman, and Gavin Doherty

Towards Systematic Specification and Verification of Fairness Requirements: A Position Paper.arXiv preprint arXiv:2509.20387(2025). Seamus Ryan, Wanling Cai, Robert Bowman, and Gavin Doherty

work page arXiv 2025
[28]

ACM Transactions on Computing for Healthcare6, 4 (2025), 1–26

Fairness Challenges in the Design of Machine Learning Applications for Healthcare. ACM Transactions on Computing for Healthcare6, 4 (2025), 1–26. Seamus Ryan, Camille Nadal, and Gavin Doherty

2025
[29]

IEEE Access11 (2023), 29296–29313

Integrating fairness in the software design process: An interview study with hci and ml experts. IEEE Access11 (2023), 29296–29313. Hamidreza Saffari, Mohammadamin Shafiei, Donya Rooein, and Debora Nozza

2023
[30]

In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

The Perspective of Software Professionals on Algorithmic Racism. In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–10. Xabier Saralegi and Muitze Zulaika

2023
[31]

Jessie J Smith, Michael Madaio, Robin Burke, and Casey Fiesler

Parity benchmark for measuring bias in LLMs.AI and Ethics(2024), 1–15. Jessie J Smith, Michael Madaio, Robin Burke, and Casey Fiesler

2024
[32]

In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency

Pragmatic Fairness: Evaluating ML Fairness Within the Constraints of Industry. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 628–638. Ezekiel Soremekun, Mike Papadakis, Maxime Cordy, and Yves Le Traon

2025
[33]

UCI Machine Learning Repository

BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context.arXiv preprint arXiv:2508.07090(2025). UCI Machine Learning Repository

work page arXiv 2025
[34]

Yinong Oliver Wang, Nivedha Sivakumar, Falaah Arif Khan, Rin Metcalf Susa, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, and Nicholas Apostoloff

A catalog of fairness-aware practices in machine learning engineering.arXiv preprint arXiv:2408.16683(2024). Yinong Oliver Wang, Nivedha Sivakumar, Falaah Arif Khan, Rin Metcalf Susa, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, and Nicholas Apostoloff

work page arXiv 2024
[35]

Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, and Yi Fang

Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs.arXiv preprint arXiv:2505.23996(2025). Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, and Yi Fang

work page arXiv 2025
[36]

Zhenjie Xu, Wenqing Chen, Yi Tang, Xuanying Li, Cheng Hu, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu

Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning.arXiv preprint arXiv:2502.15361(2025). Zhenjie Xu, Wenqing Chen, Yi Tang, Xuanying Li, Cheng Hu, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu

work page arXiv 2025
[37]

arXiv preprint arXiv:2503.09219(2025)

Rethinking Prompt-based Debiasing in Large Language Models. arXiv preprint arXiv:2503.09219(2025). Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang

work page arXiv 2025
[38]

https://paperswithcode.com/dataset/winobias

The WinoBias Dataset. https://paperswithcode.com/dataset/winobias. Accessed: 2025-07-22. Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing

2025
[39]

Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity

Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867(2023)

work page arXiv 2023