Recognition: unknown
Intersectional Fairness in Large Language Models
Pith reviewed 2026-05-09 23:38 UTC · model grok-4.3
The pith
No large language model achieves consistent fairness or reliability across intersecting demographic groups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that modern LLMs display stereotype-aligned behavior in intersectional settings: accuracy improves when the right answer fits existing stereotypes, especially for race-gender pairs, while subgroup fairness metrics reveal persistently uneven distributions across groups and responses fluctuate across repeated runs. No model maintains both high accuracy and even, consistent outcomes when demographic attributes intersect. The authors therefore conclude that competence in these tests is partly tied to stereotype-consistent cues and that no evaluated model reaches reliable fairness across intersectional conditions.
What carries the argument
The multi-metric evaluation protocol that compares bias scores, subgroup fairness, accuracy, and run-to-run consistency on ambiguous versus disambiguated contexts for intersecting demographic attributes.
If this is right
- Accuracy is higher when the correct answer aligns with a stereotype than when it contradicts one.
- The alignment effect is strongest for race-gender intersections.
- Subgroup fairness metrics can report low disparity while outcome distributions remain uneven across intersectional groups.
- Responses vary in consistency across repeated runs and can include stereotype-aligned answers.
- Fairness evaluation must combine bias scores, subgroup metrics, and consistency checks rather than rely on accuracy alone.
Where Pith is reading between the lines
- Models that depend on stereotype cues for higher accuracy may produce systematically skewed decisions in applications such as hiring screening or clinical summarization.
- Requiring models to maintain consistency across runs would expose additional fairness failures not visible in single-pass tests.
- New training objectives that penalize accuracy gains tied to stereotype alignment could be tested directly against the same ambiguous/disambiguated split.
Load-bearing premise
The selected benchmark datasets and fairness metrics capture the main real-world intersectional fairness problems that matter for deployed language models.
What would settle it
An experiment in which at least one LLM shows equal accuracy on stereotype-aligned and stereotype-contradicting answers across all tested intersections, produces even outcome distributions in every subgroup, and returns identical decisions on repeated runs for both ambiguous and disambiguated questions.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates intersectional fairness across six LLMs on ambiguous and disambiguated contexts drawn from two benchmark datasets. It reports results on bias scores, subgroup fairness, accuracy, and multi-run consistency (including negative/non-negative polarities), concluding that modern LLMs show competence in ambiguous settings that limits metric informativeness, that accuracy tracks stereotype alignment (especially for race-gender intersections), and that no evaluated model exhibits consistently reliable or fair behavior across intersectional groups.
Significance. If the central empirical patterns hold, the work usefully demonstrates that accuracy alone is an insufficient proxy for intersectional fairness and that combining bias, subgroup, and consistency metrics across contexts and repeated runs reveals persistent limitations. The multi-metric, multi-run design is a methodological strength that could support more robust future evaluations.
major comments (3)
- [Abstract] Abstract: the headline claim that 'no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings' is load-bearing yet rests on the untested assumption that the two (unnamed) benchmark datasets and the composite of bias scores, subgroup fairness, accuracy, and consistency are sufficient proxies. The abstract itself notes that ambiguous contexts produce 'sparse non-unknown predictions' that limit metric informativeness, directly weakening support for the universal negative conclusion.
- [Abstract and §3] Abstract and §3 (Datasets and Metrics): the observation that accuracy is higher when the correct answer aligns with a stereotype (especially race-gender) could be an artifact of benchmark construction if the 'correct answer' labels or context distributions embed the same stereotypes being measured. No evidence is provided that the metrics remain stable under prompt variation or dataset expansion, which is required to treat the pattern as a general model property rather than a dataset-specific effect.
- [§4] §4 (Results): the subgroup fairness metrics are reported to show 'low observed disparity in some cases' yet 'uneven outcome distributions.' Without the exact definitions of the subgroup fairness metrics, the statistical tests used, or controls for multiple comparisons across intersectional pairs, it is unclear whether the reported unevenness is statistically reliable or merely descriptive.
minor comments (2)
- [Abstract] The abstract refers to 'negative and non-negative question polarities' without a concise definition or example; adding one sentence of clarification would improve readability.
- [§4] Table or figure captions for the multi-run consistency results should explicitly state the number of runs and the exact consistency metric (e.g., majority vote, entropy) to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below with point-by-point responses, indicating where we will make revisions to improve clarity, rigor, and transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that 'no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings' is load-bearing yet rests on the untested assumption that the two (unnamed) benchmark datasets and the composite of bias scores, subgroup fairness, accuracy, and consistency are sufficient proxies. The abstract itself notes that ambiguous contexts produce 'sparse non-unknown predictions' that limit metric informativeness, directly weakening support for the universal negative conclusion.
Authors: We appreciate the referee's point regarding the scope of our headline claim. The conclusion is drawn from results across the two benchmark datasets and the multi-metric framework (bias scores, subgroup fairness, accuracy, and multi-run consistency) described in the paper. We explicitly note the limitations of ambiguous contexts in the abstract and focus the strongest claims on disambiguated contexts where predictions are more informative. To address the concern about framing, we will revise the abstract to qualify the claim as applying to the evaluated models on these specific benchmarks and metrics, rather than as a universal statement across all possible datasets or settings. revision: yes
-
Referee: [Abstract and §3] Abstract and §3 (Datasets and Metrics): the observation that accuracy is higher when the correct answer aligns with a stereotype (especially race-gender) could be an artifact of benchmark construction if the 'correct answer' labels or context distributions embed the same stereotypes being measured. No evidence is provided that the metrics remain stable under prompt variation or dataset expansion, which is required to treat the pattern as a general model property rather than a dataset-specific effect.
Authors: We acknowledge that the observed accuracy-stereotype alignment pattern could be influenced by the specific construction of the benchmarks, and that broader validation would be valuable. The datasets are established fairness benchmarks with ground-truth answers derived from the provided contexts. Our multi-run analysis across negative and non-negative polarities provides evidence of consistency within the current setup, but we did not perform systematic prompt variations or test on expanded datasets. We will revise §3 to include more detail on dataset construction and add a limitations discussion noting the need for future robustness checks under prompt variation and dataset expansion. This will clarify the scope of the current findings. revision: partial
-
Referee: [§4] §4 (Results): the subgroup fairness metrics are reported to show 'low observed disparity in some cases' yet 'uneven outcome distributions.' Without the exact definitions of the subgroup fairness metrics, the statistical tests used, or controls for multiple comparisons across intersectional pairs, it is unclear whether the reported unevenness is statistically reliable or merely descriptive.
Authors: We thank the referee for identifying this gap in presentation. We will revise §4 to include the exact mathematical definitions and formulas for all subgroup fairness metrics, specify the statistical tests used to evaluate disparities (including significance levels), and apply appropriate corrections for multiple comparisons across intersectional pairs. These additions will allow readers to assess whether the uneven outcome distributions reflect statistically reliable patterns. revision: yes
Circularity Check
No circularity: purely empirical evaluation with no derivations or self-referential reductions
full rationale
The paper reports results from running six LLMs on two public benchmark datasets, measuring bias scores, subgroup fairness, accuracy, and consistency across contexts and polarities. No equations, fitted parameters, or derivation steps are present that could reduce outputs to inputs by construction. The central claim follows directly from the observed empirical patterns rather than from any self-definition, ansatz, or self-citation load-bearing step. Self-citations, if any, are not required to support the evaluation methodology itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard subgroup fairness metrics (e.g., demographic parity or equalized odds) are appropriate for evaluating LLM output distributions across intersectional groups.
Reference graph
Works this paper leans on
-
[1]
https://figshare.com/s/62135c4835b127ab376e Agency for Healthcare Research and Quality
2026.Replication Package for Intersectional Fairness in Large Language Models. https://figshare.com/s/62135c4835b127ab376e Agency for Healthcare Research and Quality
2026
-
[2]
Retrieved from https://meps.ahrq.gov/mepsweb/ data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181
MEPS HC-181: 2015 Full Year Consolidated Data File. Retrieved from https://meps.ahrq.gov/mepsweb/ data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181. Luciano Baresi, Chiara Criscuolo, and Carlo Ghezzi
2015
-
[3]
Cléa Chataigner, Rebecca Ma, Prakhar Ganesh, Afaf Ta¨"ık, Elliot Creager, and Golnoosh Farnadi
Practitioner Insights on Fairness Requirements in the AI Development Life Cycle: An Interview Study.arXiv preprint arXiv:2512.13830(2025). Cléa Chataigner, Rebecca Ma, Prakhar Ganesh, Afaf Ta¨"ık, Elliot Creager, and Golnoosh Farnadi
-
[4]
Zhenpeng Chen, Jie M Zhang, Max Hort, Mark Harman, and Federica Sarro
Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework.arXiv preprint arXiv:2505.03563(2025). Zhenpeng Chen, Jie M Zhang, Max Hort, Mark Harman, and Federica Sarro. 2024a. Fairness testing: A comprehensive survey and analysis of trends. ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–59. Under Revie...
-
[5]
Kimberlé Crenshaw
Socially responsible ai algorithms: Issues, purposes, and challenges.Journal of Artificial Intelligence Research71 (2021), 1137–1181. Kimberlé Crenshaw
2021
-
[6]
Ramandeep Singh Dehal, Mehak Sharma, and Ronnie de Souza Santos
Software fairness debt: Building a research agenda for addressing bias in AI systems.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–21. Ramandeep Singh Dehal, Mehak Sharma, and Ronnie de Souza Santos
2025
-
[7]
Wesley Hanwen Deng, Nur Yildirim, Monica Chang, Motahhare Eslami, Kenneth Holstein, and Michael Madaio
Algorithmic fairness: challenges to building an effective regulatory regime.Frontiers in Artificial Intelligence8 (2025), 1637134. Wesley Hanwen Deng, Nur Yildirim, Monica Chang, Motahhare Eslami, Kenneth Holstein, and Michael Madaio
2025
-
[8]
InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency
Investigating practices and opportunities for cross-functional collaboration around AI fairness in industry practice. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 705–716. Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta
2023
-
[9]
Fairness-aware machine learning engineering: how far are we?Empirical software engineering29, 1 (2024),
2024
-
[10]
In2020 IEEE 36th international conference on data engineering (ICDE)
An intersectional definition of fairness. In2020 IEEE 36th international conference on data engineering (ICDE). IEEE, 1918–1921. Batool Haider, Atmika Gorti, Aman Chadha, and Manas Gaur
1918
-
[11]
Rem Hida, Masahiro Kaneko, and Naoaki Okazaki
Mental Health Equity in LLMs: Leveraging Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives.arXiv preprint arXiv:2506.18116(2025). Rem Hida, Masahiro Kaneko, and Naoaki Okazaki
-
[12]
Social bias evaluation for large language models requires prompt variations.arXiv preprint arXiv:2407.03129(2024). Yufei Huang and Deyi Xiong
-
[13]
In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2917–2929. Jiho Jin, Woosung Kang, Junho Myung, and Alice Oh
2024
-
[14]
arXiv preprint arXiv:2503.06987(2025)
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations. arXiv preprint arXiv:2503.06987(2025). Jiho Jin, Jiseon Kim, Nayeon Lee, Haneul Yoo, Alice Oh, and Hwaran Lee
-
[15]
Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu
KoBBQ: Korean bias benchmark for question answering.Transactions of the Association for Computational Linguistics12 (2024), 507–524. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu
2024
-
[16]
Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin
Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213. Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin
2022
-
[17]
Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang
Exploring the Impact of Temperature on Large Language Models: Hot or Cold?Procedia Computer Science264 (2025), 242–251. Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang
2025
-
[18]
A survey on fairness in large language models
A survey on fairness in large language models.arXiv preprint arXiv:2308.10149(2023). Zhao Liu, Tian Xie, and Xueru Zhang
-
[19]
Qinghua Lu, Liming Zhu, Xiwei Xu, Jon Whittle, and Zhenchang Xing
Evaluating and mitigating social bias for large language models in open-ended settings.arXiv preprint arXiv:2412.06134(2024). Qinghua Lu, Liming Zhu, Xiwei Xu, Jon Whittle, and Zhenchang Xing
-
[20]
Moin Nadeem, Anna Bethke, and Siva Reddy
FairST: A novel approach for machine learning bias repair through latent sensitive attribute translation.Information and Software Technology(2025), 107900. Moin Nadeem, Anna Bethke, and Siva Reddy
2025
-
[21]
InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1953–1967. Vishal Narnaware, Ashmal Vayani, Rohit Gupta, Sirnam Swetha, and Mubarak Shah
2020
-
[22]
Sb-bench: Stereotype bias benchmark for large multimodal models.arXiv preprint arXiv:2502.08779(2025). Cathy O’Neil. 2017.Weapons of math destruction: How big data increases inequality and threatens democracy. Crown. Under Review 24 Chaima Boufaied, Ronnie de Souza Santos, and Ann Barcomb Aastha Pant, Rashina Hoda, Chakkrit Tantithamthavorn, and Burak Turhan
-
[23]
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman
Navigating fairness: practitioners’ understanding, challenges, and strategies in AI/ML development.Empirical Software Engineering30, 3 (2025), 1–38. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman
2025
-
[24]
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman
2086–2105. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman
2086
-
[25]
BBQ: A Hand-Built Bias Benchmark for Question Answering
BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193(2021). Nga Pham, Hung Pham Ngoc, and Anh Nguyen-Duc
work page internal anchor Pith review arXiv 2021
-
[26]
Qusai Ramadan, Jukka Ruohonen, Abhishek Tiwari, Adam Alami, and Zeyd Boukhers
Fairness for machine learning software in education: A systematic mapping study.Journal of Systems and Software219 (2025), 112244. Qusai Ramadan, Jukka Ruohonen, Abhishek Tiwari, Adam Alami, and Zeyd Boukhers
2025
-
[27]
Seamus Ryan, Wanling Cai, Robert Bowman, and Gavin Doherty
Towards Systematic Specification and Verification of Fairness Requirements: A Position Paper.arXiv preprint arXiv:2509.20387(2025). Seamus Ryan, Wanling Cai, Robert Bowman, and Gavin Doherty
-
[28]
ACM Transactions on Computing for Healthcare6, 4 (2025), 1–26
Fairness Challenges in the Design of Machine Learning Applications for Healthcare. ACM Transactions on Computing for Healthcare6, 4 (2025), 1–26. Seamus Ryan, Camille Nadal, and Gavin Doherty
2025
-
[29]
IEEE Access11 (2023), 29296–29313
Integrating fairness in the software design process: An interview study with hci and ml experts. IEEE Access11 (2023), 29296–29313. Hamidreza Saffari, Mohammadamin Shafiei, Donya Rooein, and Debora Nozza
2023
-
[30]
In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
The Perspective of Software Professionals on Algorithmic Racism. In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–10. Xabier Saralegi and Muitze Zulaika
2023
-
[31]
Jessie J Smith, Michael Madaio, Robin Burke, and Casey Fiesler
Parity benchmark for measuring bias in LLMs.AI and Ethics(2024), 1–15. Jessie J Smith, Michael Madaio, Robin Burke, and Casey Fiesler
2024
-
[32]
In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency
Pragmatic Fairness: Evaluating ML Fairness Within the Constraints of Industry. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 628–638. Ezekiel Soremekun, Mike Papadakis, Maxime Cordy, and Yves Le Traon
2025
-
[33]
UCI Machine Learning Repository
BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context.arXiv preprint arXiv:2508.07090(2025). UCI Machine Learning Repository
-
[34]
A catalog of fairness-aware practices in machine learning engineering.arXiv preprint arXiv:2408.16683(2024). Yinong Oliver Wang, Nivedha Sivakumar, Falaah Arif Khan, Rin Metcalf Susa, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, and Nicholas Apostoloff
-
[35]
Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, and Yi Fang
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs.arXiv preprint arXiv:2505.23996(2025). Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, and Yi Fang
-
[36]
Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning.arXiv preprint arXiv:2502.15361(2025). Zhenjie Xu, Wenqing Chen, Yi Tang, Xuanying Li, Cheng Hu, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu
-
[37]
arXiv preprint arXiv:2503.09219(2025)
Rethinking Prompt-based Debiasing in Large Language Models. arXiv preprint arXiv:2503.09219(2025). Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang
-
[38]
https://paperswithcode.com/dataset/winobias
The WinoBias Dataset. https://paperswithcode.com/dataset/winobias. Accessed: 2025-07-22. Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing
2025
-
[39]
Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity
Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867(2023)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.