Invisible Influences: Investigating Implicit Intersectional Biases through Persona Engineering in Large Language Models

Achyuth Mukund; Nandini Arimanda; Rajesh Sharma; Sakthi Balan Muthiah

arxiv: 2604.06213 · v1 · submitted 2026-03-16 · 💻 cs.CL · cs.AI

Invisible Influences: Investigating Implicit Intersectional Biases through Persona Engineering in Large Language Models

Nandini Arimanda , Achyuth Mukund , Sakthi Balan Muthiah , Rajesh Sharma This is my paper

Pith reviewed 2026-05-15 09:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM biaspersona engineeringintersectional biasbias amplificationBADx metricimplicit biasmodel evaluationexplainability

0 comments

The pith

Persona context significantly modulates implicit intersectional biases in large language models, as shown by the new BADx metric outperforming static tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BADx to quantify how adopting social personas changes bias levels in LLMs, addressing limitations of static embedding tests that miss dynamic shifts. It measures differential bias scores, persona sensitivity, and volatility across five models using six persona frames, finding that context alters bias in measurable ways. GPT-4o shows high sensitivity while LLaMA-4 stays stable, and the metric reveals biases overlooked by fixed tests. This matters for applications where LLMs take on roles, as personas can amplify or suppress unfair associations depending on the model.

Core claim

The paper establishes that persona engineering in LLMs produces measurable shifts in implicit intersectional bias, captured by the Bias Amplification Differential and Explainability Score (BADx) which integrates differential scores from CEAT, I-WEAT, and I-SEAT with a Persona Sensitivity Index and volatility measure, plus LIME attributions, and demonstrates superior detection of context-sensitive biases compared to static baselines across five state-of-the-art models.

What carries the argument

BADx (Bias Amplification Differential and Explainability Score), a composite metric that computes differential bias from base tests, adds persona sensitivity and volatility, and incorporates LIME for local explanations of amplification.

If this is right

Persona frames cause different models to exhibit unique bias profiles, such as high volatility in some and low in others.
BADx identifies context-sensitive biases that static methods miss, enabling more targeted audits.
The approach offers a scalable way to evaluate dynamic bias across multiple LLMs under role adoption.
Results imply that bias behavior depends on both the model architecture and the specific persona applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could use BADx scores to select models for role-playing tasks in domains like education or healthcare where bias stability matters.
The method could extend to testing whether real user conversations trigger similar bias shifts beyond scripted personas.
Model providers might incorporate BADx-style checks during fine-tuning to reduce persona-induced volatility.
Neighbouring work on prompt safety could combine BADx with output filtering to address amplified biases in deployed systems.

Load-bearing premise

The base bias metrics remain valid when applied to persona-conditioned outputs and LIME attributions accurately explain amplification without adding new artifacts.

What would settle it

Running the same bias test items on an LLM both with and without a specific persona prompt and finding identical scores on the differential, sensitivity, and volatility components would falsify the modulation claim.

Figures

Figures reproduced from arXiv: 2604.06213 by Achyuth Mukund, Nandini Arimanda, Rajesh Sharma, Sakthi Balan Muthiah.

**Figure 2.** Figure 2: Visualization of Task 1 - Average bias score across [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of Task 2 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Volatility Measure of the LLMs modulation, characterized by high variability and pronounced amplification for privileged personas, alongside suppression for marginalized groups.DeepSeek-R1 demonstrates aggressive bias dampening with a conservative persona expression but shows the highest volatility, indicating less stable behaviour. LLaMA-4 maintains the most stable and consistent responses with minimal… view at source ↗

**Figure 4.** Figure 4: Sensitivity Index of each LLM Claude 4.o Sonnet maintains moderate PSI values (range: –0.088 to +0.109) and low volatility, consistent with alignment-focused mitigation. Its bias pattern is balanced, allowing nuanced but controlled persona expression. Gemma-3n E4B exhibits an intermediate PSI profile, with mildly negative values for marginalized personas (e.g., –0.086 for Persona A) and positive values fo… view at source ↗

read the original abstract

Large Language Models (LLMs) excel at human-like language generation but often embed and amplify implicit, intersectional biases, especially under persona-driven contexts. Existing bias audits rely on static, embedding-based tests (CEAT, I-WEAT, I-SEAT) that quantify absolute association strengths. We show that they have limitations in capturing dynamic shifts when models adopt social roles. We address this gap by introducing the Bias Amplification Differential and Explainability Score (BADx): a novel, scalable metric that measures persona-induced bias amplification and integrates local explainability insights. BADx comprises three components - differential bias scores (BAD, based on CEAT, I-WEAT, I-SEAT),Persona Sensitivity Index (PSI), and Volatility (Standard Deviation), augmented by LIME-based analysis for emphasizing explainability. This study is divided and performed as two different tasks. Task 1 establishes static bias baselines, and Task 2 applies six persona frames (marginalized and structurally advantaged) to measure BADx, PSI, and volatility. This is studied across five state-of-the-art LLMs (GPT-4o, DeepSeek-R1, LLaMA-4, Claude 4.0 Sonnet and Gemma-3n E4B). Results show persona context significantly modulates bias. GPT-4o exhibits high sensitivity and volatility; DeepSeek-R1 suppresses bias but with erratic volatility; LLaMA-4 maintains low volatility and a stable bias profile with limited amplification; Claude 4.0 Sonnet achieves balanced modulation; and Gemma-3n E4B attains the lowest volatility with moderate amplification. BADx performs better than static methods by revealing context-sensitive biases overlooked in static methods. Our unified method offers a systematic way to detect dynamic implicit intersectional bias in five popular LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BADx combines existing bias tests with sensitivity and volatility measures to track persona effects, but missing controls and numbers leave the claims hard to trust.

read the letter

The punchline here is that the authors introduce BADx as a composite score for persona-induced bias changes in LLMs and apply it to five models, claiming it uncovers context-sensitive biases that static methods miss. They break it into differential scores from CEAT and similar tests, a persona sensitivity index, volatility, and LIME explanations. This is new in the way it frames the combination for dynamic, role-based use cases, which are increasingly common. The paper does well by testing across GPT-4o, DeepSeek, LLaMA, Claude, and Gemma, and by noting specific patterns like high volatility in some models and stability in others. That kind of comparative view can be useful for practitioners choosing models for sensitive applications. The soft spots are more significant. No actual numbers, confidence intervals, or prompt templates appear in the description, so the results can't be evaluated for magnitude or reliability. The core assumption that existing association metrics remain valid after persona conditioning lacks support from control tests. A scrambled prompt or neutral role condition would help isolate whether changes come from the social role semantics or just from longer or differently structured inputs. Without that, the amplification claims could be overstated. LIME is added for explainability, but there's no check on whether it aligns with the model's actual internal representations under these conditions. Overall, the work engages honestly with the limitations of static bias tests and tries to build something more practical. It is aimed at researchers and engineers focused on fairness auditing for deployed LLMs. A reader in that area could get value from the metric design as a template, even if the current evidence is preliminary. I would send this to peer review, but only after the authors provide the full quantitative results, exact experimental setup, and control experiments to address the transferability of the base metrics.

Referee Report

3 major / 2 minor

Summary. The paper claims that static bias tests (CEAT, I-WEAT, I-SEAT) fail to capture dynamic bias shifts under persona conditioning in LLMs. It introduces the BADx metric (differential bias scores + Persona Sensitivity Index + volatility, augmented by LIME) and reports that six persona frames produce measurable bias modulation across five models (GPT-4o, DeepSeek-R1, LLaMA-4, Claude 4.0 Sonnet, Gemma-3n E4B), with BADx revealing context-sensitive biases missed by static baselines. Task 1 establishes static baselines; Task 2 applies personas and computes BADx components.

Significance. If the metric were properly validated with controls and quantitative reporting, the work would offer a useful extension for auditing context-dependent intersectional bias in deployed LLMs. The multi-model scope and attempt to combine differential scoring with local explainability are positive features. At present the absence of numerical results, error bars, exact prompts, and validation experiments prevents any assessment of whether the claimed improvements are real or artifactual.

major comments (3)

[Task 2] Task 2 / BADx definition: The differential bias scores treat CEAT/I-WEAT/I-SEAT values as directly comparable before and after persona conditioning, yet no control condition (scrambled/neutral role prompts or fixed-length generation) is described to isolate semantic persona effects from prompt-structure or length artifacts. This assumption is load-bearing for the central claim that persona frames produce genuine amplification.
[Results] Results section: The abstract states that 'persona context significantly modulates bias' and that 'BADx performs better than static methods' but supplies no numerical BADx/PSI/volatility values, standard errors, or statistical tests for any model-persona pair. Without these data the empirical support for model-specific claims (e.g., GPT-4o high sensitivity, LLaMA-4 low volatility) cannot be evaluated.
[BADx metric] LIME integration: The paper layers LIME attributions on persona-conditioned outputs without reporting any sanity check that the attributions recover internal association patterns rather than surface prompt tokens. This step is required to justify the 'explainability' component of BADx.

minor comments (2)

[BADx metric] The exact formulas for PSI and Volatility (standard deviation) are described only at a high level; explicit equations or pseudocode would improve reproducibility.
[Task 2] Persona frame definitions and the precise prompt templates used for each of the six frames are not listed; inclusion of the full prompt set is necessary for replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments have helped us identify key areas where the manuscript can be strengthened. We have revised the paper to incorporate control conditions, provide full numerical results with statistical tests, and add validation for the LIME component. These changes directly address the concerns while preserving the core contributions of the BADx metric.

read point-by-point responses

Referee: [Task 2] Task 2 / BADx definition: The differential bias scores treat CEAT/I-WEAT/I-SEAT values as directly comparable before and after persona conditioning, yet no control condition (scrambled/neutral role prompts or fixed-length generation) is described to isolate semantic persona effects from prompt-structure or length artifacts. This assumption is load-bearing for the central claim that persona frames produce genuine amplification.

Authors: We agree that the absence of explicit controls in the original submission leaves open the possibility of prompt-structure confounds. In the revised manuscript we have added two control conditions to Task 2: (i) neutral prompts that preserve length and structure but contain no persona content, and (ii) scrambled-persona prompts that retain the same tokens in randomized order. Comparative results (new Table 3) show that bias shifts remain statistically significant only under the intact persona conditions, supporting that the observed amplification is semantically driven. These controls are now described in Section 4.2 and the corresponding analysis in Section 4.3. revision: yes
Referee: [Results] Results section: The abstract states that 'persona context significantly modulates bias' and that 'BADx performs better than static methods' but supplies no numerical BADx/PSI/volatility values, standard errors, or statistical tests for any model-persona pair. Without these data the empirical support for model-specific claims (e.g., GPT-4o high sensitivity, LLaMA-4 low volatility) cannot be evaluated.

Authors: We acknowledge that the original submission reported only qualitative summaries. The revised version now includes a full quantitative results section with Table 2 reporting exact BADx, PSI, and volatility values for every model-persona pair, accompanied by standard errors and p-values from paired Wilcoxon signed-rank tests against the static baselines. These numbers directly substantiate the model-specific patterns (e.g., GPT-4o’s elevated sensitivity and LLaMA-4’s low volatility) and allow readers to assess the magnitude of improvement over static methods. revision: yes
Referee: [BADx metric] LIME integration: The paper layers LIME attributions on persona-conditioned outputs without reporting any sanity check that the attributions recover internal association patterns rather than surface prompt tokens. This step is required to justify the 'explainability' component of BADx.

Authors: We agree that a sanity check is necessary to validate the LIME component. The revised manuscript adds Section 5.4, which reports two validation experiments: (1) alignment of LIME top features with the bias-relevant tokens identified in the static CEAT/I-WEAT baselines, and (2) perturbation tests showing that removing high-attribution tokens alters bias scores more than removing surface prompt tokens. These checks confirm that the attributions capture internal association patterns rather than superficial prompt artifacts, thereby supporting the explainability claim of BADx. revision: yes

Circularity Check

0 steps flagged

No significant circularity in BADx derivation

full rationale

The paper defines BADx explicitly as a composite of differential scores (BAD) computed from established external metrics (CEAT, I-WEAT, I-SEAT) plus two new indices (PSI and volatility) plus LIME. This is an additive empirical construction applied to before/after persona outputs, not a self-referential loop or fitted parameter renamed as prediction. No equations reduce the output to the input by definition, no self-citations are invoked as uniqueness theorems, and no ansatz is smuggled. Results are comparative measurements across five models and six personas; the derivation chain remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the validity of prior embedding-based bias tests and on the assumption that chosen persona frames are representative; no free parameters are explicitly fitted in the abstract, but persona selection functions as an ad-hoc choice.

free parameters (1)

Persona frames
Six specific marginalized and structurally advantaged personas are selected; their exact wording and selection criteria are not detailed.

axioms (1)

domain assumption Existing static bias metrics (CEAT, I-WEAT, I-SEAT) provide a reliable baseline for measuring differential bias under persona conditions.
The BAD component is defined directly from these metrics.

invented entities (1)

BADx metric no independent evidence
purpose: To quantify persona-induced bias amplification and explainability
Newly defined composite score; no external falsifiable prediction or independent validation is mentioned.

pith-pipeline@v0.9.0 · 5647 in / 1447 out tokens · 55899 ms · 2026-05-15T09:55:33.930344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

[1]

Anthropic. 2025. Claude 4.0 Sonnet: Safety-First Language Model. https://www. anthropic.com/research/claude-4-sonnet. Accessed: 2025-08-29

work page 2025
[2]

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Lan- guage (technology) is power: A critical survey of "bias" in nlp. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5454–5476

work page 2020
[3]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, Vol. 33. 1877–1901

work page 2020
[4]

2024.Unmasking AI: My mission to protect what is human in a world of machines

Joy Buolamwini. 2024.Unmasking AI: My mission to protect what is human in a world of machines. Random House

work page 2024
[5]

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases.Science356, 6334 (2017), 183–186

work page 2017
[6]

Chapman University. 2025. Bias in AI. https://www.chapman.edu/ai/bias-in- ai.aspx Accessed: 2025-06-09

work page 2025
[7]

1988.Statistical Power Analysis for the Behavioral Sciences

Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences. Rout- ledge

work page 1988
[8]

Maria De-Arteaga et al. 2019. Bias in Bios: A Case Study of Semantic Represen- tation Bias in a High-Stakes Setting.Proceedings of the Conference on Fairness, Accountability, and Transparency(2019)

work page 2019
[9]

DeepSeek AI. 2025. DeepSeek R1: Interpretable Deep Semantic Model. https: //deepseek.ai/r1-paper. Accessed: 2025-08-29

work page 2025
[10]

Eva Derous and Roland Pepermans. 2019. Gender discrimination in hiring: Intersectional effects with ethnicity and cognitive job demands.Archives of Scientific Psychology7, 1 (2019), 40

work page 2019
[11]

Tommaso Dolci, Fabio Azzalini, and Mara Tanelli. 2023. Improving gender-related fairness in sentence encoders: A semantics-based approach.Data Science and Engineering8, 2 (2023), 177–195

work page 2023
[12]

Eichstaedt, Robert J

Johannes C. Eichstaedt, Robert J. Smith, Lyle H. Ungar, Sharath Chandra Guntuku, and Daniel J. Hopkins. 2022. Negative associations in word embeddings predict anti-black bias in the real world.Nature Human Behaviour6, 7 (2022), 963–975. https://doi.org/10.1038/s41562-022-01355-8

work page doi:10.1038/s41562-022-01355-8 2022
[13]

Google Research. 2025. Gemma-3n E4B: A Compact, Efficient LLM. https: //research.google.com/gemma-3n. Accessed: 2025-08-29

work page 2025
[14]

Wei Guo and Aylin Caliskan. 2021. Detecting emergent intersectional biases: Contextualized word embeddings contain a distribution of human-like biases. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 122–133

work page 2021
[15]

Xinru Lin and Luyang Li. 2025. Literature Review: Implicit Bias in LLMs: A Survey. https://www.themoonlight.io/en/review/implicit-bias-in-llms-a-survey The Moonlight, Accessed: 2025-06-09

work page 2025
[16]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[17]

Li Lucy and David Bamman. 2021. Gender and representation bias in GPT-3 generated stories. InProceedings of the 3rd Workshop on Narrative Understanding. 48–55

work page 2021
[18]

Weicheng Ma, Brian Chiang, Tong Wu, Lili Wang, and Soroush Vosoughi. 2023. Intersectional Stereotypes in Large Language Models: Dataset and Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 8589–8597

work page 2023
[19]

Bowman, and Rachel Rudinger

Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. On Measuring Social Biases in Sentence Encoders. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics...

work page 2019
[20]

Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Lan- guage Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguis- tics, Online, 1953–1967. https://doi.org/10.18653/v1/2020...

work page doi:10.18653/v1/2020.emnlp-main.154 2020
[21]

Lynnette Hui Xian Ng, Iain J Cruickshank, and Roy Lee. 2025. Examining the influence of political bias on large language model performance in stance classi- fication. InProceedings of the International AAAI Conference on Web and Social Media, Vol. 19. 1315–1328

work page 2025
[22]

Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines rein- force racism. InAlgorithms of oppression. New York university press

work page 2018
[23]

Shiva Omrani Sabbaghi, Robert Wolfe, and Aylin Caliskan. 2023. Evaluating biased attitude associations of language models in an intersectional context. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 542–553

work page 2023
[24]

OpenAI. 2025. GPT-4o: Advanced Multimodal Language Model. https://openai. com/research/gpt-4o. Accessed: 2025-08-29

work page 2025
[25]

Georgios Panayiotou, Matteo Magnani, and Ece Calikus. 2025. Towards intersec- tional fairness in community detection.red507, 651 (2025), 378

work page 2025
[26]

BIAS Project. 2024. The BIAS Detection Framework: Bias Detection in Word Embeddings and Language Models for European Languages. https://www.biasproject.eu/wp-content/uploads/2024/11/The-BIAS-Detection- Framework_Bias-Detection-in-Word-Embeddings-and-Language-Models-for- European-Languages.pdf

work page 2024
[27]

Valerie Purdie-Vaughns and Richard P Eibach. 2008. Intersectional invisibility: The distinctive advantages and disadvantages of multiple subordinate-group identities.Sex roles59 (2008), 377–391

work page 2008
[28]

N. R. Sahoo, P. P. Kulkarni, N. Asad, A. Ahmad, T. Goyal, A. Garimella, and P. Bhattacharyya. 2024. Implicit Intersectional Bias Score (IIBS). https://arxiv.org/ abs/2403.20147 Proposed metric integrating binary prevalence of intersectional bias

work page arXiv 2024
[29]

Nihar Ranjan Sahoo, Pranamya Prashant Kulkarni, Narjis Asad, Arif Ahmad, Tanu Goyal, Aparna Garimella, and Pushpak Bhattacharyya. 2024. IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context.arXiv preprint arXiv:2403.20147(2024). https://doi.org/10.48550/arXiv. 2403.20147

work page internal anchor Pith review doi:10.48550/arxiv 2024
[30]

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The woman worked as a babysitter: On biases in language generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 3407–3412

work page 2019
[31]

Yi Chern Tan and L Elisa Celis. 2019. Assessing social and intersectional biases in contextualized word representations.Advances in neural information processing systems32 (2019)

work page 2019
[32]

Hugo Touvron et al. 2025. LLaMA 4: Open and Efficient Large Language Model. Meta AI Research(2025). https://ai.meta.com/research/llama-4 Accessed: 2025- 08-29

work page 2025
[33]

Nicol Turner Lee. 2018. Detecting racial bias in algorithms and machine learning. Journal of Information, Communication and Ethics in Society16, 3 (2018), 252–260

work page 2018

[1] [1]

Anthropic. 2025. Claude 4.0 Sonnet: Safety-First Language Model. https://www. anthropic.com/research/claude-4-sonnet. Accessed: 2025-08-29

work page 2025

[2] [2]

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Lan- guage (technology) is power: A critical survey of "bias" in nlp. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5454–5476

work page 2020

[3] [3]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, Vol. 33. 1877–1901

work page 2020

[4] [4]

2024.Unmasking AI: My mission to protect what is human in a world of machines

Joy Buolamwini. 2024.Unmasking AI: My mission to protect what is human in a world of machines. Random House

work page 2024

[5] [5]

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases.Science356, 6334 (2017), 183–186

work page 2017

[6] [6]

Chapman University. 2025. Bias in AI. https://www.chapman.edu/ai/bias-in- ai.aspx Accessed: 2025-06-09

work page 2025

[7] [7]

1988.Statistical Power Analysis for the Behavioral Sciences

Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences. Rout- ledge

work page 1988

[8] [8]

Maria De-Arteaga et al. 2019. Bias in Bios: A Case Study of Semantic Represen- tation Bias in a High-Stakes Setting.Proceedings of the Conference on Fairness, Accountability, and Transparency(2019)

work page 2019

[9] [9]

DeepSeek AI. 2025. DeepSeek R1: Interpretable Deep Semantic Model. https: //deepseek.ai/r1-paper. Accessed: 2025-08-29

work page 2025

[10] [10]

Eva Derous and Roland Pepermans. 2019. Gender discrimination in hiring: Intersectional effects with ethnicity and cognitive job demands.Archives of Scientific Psychology7, 1 (2019), 40

work page 2019

[11] [11]

Tommaso Dolci, Fabio Azzalini, and Mara Tanelli. 2023. Improving gender-related fairness in sentence encoders: A semantics-based approach.Data Science and Engineering8, 2 (2023), 177–195

work page 2023

[12] [12]

Eichstaedt, Robert J

Johannes C. Eichstaedt, Robert J. Smith, Lyle H. Ungar, Sharath Chandra Guntuku, and Daniel J. Hopkins. 2022. Negative associations in word embeddings predict anti-black bias in the real world.Nature Human Behaviour6, 7 (2022), 963–975. https://doi.org/10.1038/s41562-022-01355-8

work page doi:10.1038/s41562-022-01355-8 2022

[13] [13]

Google Research. 2025. Gemma-3n E4B: A Compact, Efficient LLM. https: //research.google.com/gemma-3n. Accessed: 2025-08-29

work page 2025

[14] [14]

Wei Guo and Aylin Caliskan. 2021. Detecting emergent intersectional biases: Contextualized word embeddings contain a distribution of human-like biases. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 122–133

work page 2021

[15] [15]

Xinru Lin and Luyang Li. 2025. Literature Review: Implicit Bias in LLMs: A Survey. https://www.themoonlight.io/en/review/implicit-bias-in-llms-a-survey The Moonlight, Accessed: 2025-06-09

work page 2025

[16] [16]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[17] [17]

Li Lucy and David Bamman. 2021. Gender and representation bias in GPT-3 generated stories. InProceedings of the 3rd Workshop on Narrative Understanding. 48–55

work page 2021

[18] [18]

Weicheng Ma, Brian Chiang, Tong Wu, Lili Wang, and Soroush Vosoughi. 2023. Intersectional Stereotypes in Large Language Models: Dataset and Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 8589–8597

work page 2023

[19] [19]

Bowman, and Rachel Rudinger

Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. On Measuring Social Biases in Sentence Encoders. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics...

work page 2019

[20] [20]

Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Lan- guage Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguis- tics, Online, 1953–1967. https://doi.org/10.18653/v1/2020...

work page doi:10.18653/v1/2020.emnlp-main.154 2020

[21] [21]

Lynnette Hui Xian Ng, Iain J Cruickshank, and Roy Lee. 2025. Examining the influence of political bias on large language model performance in stance classi- fication. InProceedings of the International AAAI Conference on Web and Social Media, Vol. 19. 1315–1328

work page 2025

[22] [22]

Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines rein- force racism. InAlgorithms of oppression. New York university press

work page 2018

[23] [23]

Shiva Omrani Sabbaghi, Robert Wolfe, and Aylin Caliskan. 2023. Evaluating biased attitude associations of language models in an intersectional context. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 542–553

work page 2023

[24] [24]

OpenAI. 2025. GPT-4o: Advanced Multimodal Language Model. https://openai. com/research/gpt-4o. Accessed: 2025-08-29

work page 2025

[25] [25]

Georgios Panayiotou, Matteo Magnani, and Ece Calikus. 2025. Towards intersec- tional fairness in community detection.red507, 651 (2025), 378

work page 2025

[26] [26]

BIAS Project. 2024. The BIAS Detection Framework: Bias Detection in Word Embeddings and Language Models for European Languages. https://www.biasproject.eu/wp-content/uploads/2024/11/The-BIAS-Detection- Framework_Bias-Detection-in-Word-Embeddings-and-Language-Models-for- European-Languages.pdf

work page 2024

[27] [27]

Valerie Purdie-Vaughns and Richard P Eibach. 2008. Intersectional invisibility: The distinctive advantages and disadvantages of multiple subordinate-group identities.Sex roles59 (2008), 377–391

work page 2008

[28] [28]

N. R. Sahoo, P. P. Kulkarni, N. Asad, A. Ahmad, T. Goyal, A. Garimella, and P. Bhattacharyya. 2024. Implicit Intersectional Bias Score (IIBS). https://arxiv.org/ abs/2403.20147 Proposed metric integrating binary prevalence of intersectional bias

work page arXiv 2024

[29] [29]

Nihar Ranjan Sahoo, Pranamya Prashant Kulkarni, Narjis Asad, Arif Ahmad, Tanu Goyal, Aparna Garimella, and Pushpak Bhattacharyya. 2024. IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context.arXiv preprint arXiv:2403.20147(2024). https://doi.org/10.48550/arXiv. 2403.20147

work page internal anchor Pith review doi:10.48550/arxiv 2024

[30] [30]

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The woman worked as a babysitter: On biases in language generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 3407–3412

work page 2019

[31] [31]

Yi Chern Tan and L Elisa Celis. 2019. Assessing social and intersectional biases in contextualized word representations.Advances in neural information processing systems32 (2019)

work page 2019

[32] [32]

Hugo Touvron et al. 2025. LLaMA 4: Open and Efficient Large Language Model. Meta AI Research(2025). https://ai.meta.com/research/llama-4 Accessed: 2025- 08-29

work page 2025

[33] [33]

Nicol Turner Lee. 2018. Detecting racial bias in algorithms and machine learning. Journal of Information, Communication and Ethics in Society16, 3 (2018), 252–260

work page 2018