pith. sign in

arxiv: 2604.08561 · v1 · submitted 2026-03-17 · 💻 cs.CL · cs.LG

A Representation-Level Assessment of Bias Mitigation in Foundation Models

Pith reviewed 2026-05-15 09:46 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords bias mitigationfoundation modelsembedding spacegender biasrepresentational analysisBERTLlama2WinoDec
0
0 comments X

The pith

Bias mitigation reduces gender-occupation disparities in the embedding spaces of both encoder and decoder foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how bias mitigation changes the internal representations of foundation models by measuring associations between gender and occupation terms in their embedding spaces. It compares baseline and mitigated versions of BERT as an encoder-only model and Llama2 as a decoder-only model. The analysis finds that mitigation produces more neutral and balanced representations, with the same pattern appearing across both architectures. This positions embedding-space inspection as a direct way to audit whether debiasing has taken effect inside the model. The work also releases the WinoDec dataset of 4,000 sequences to enable similar checks on decoder-only models.

Core claim

By comparing baseline and bias-mitigated variants of BERT and Llama2, the analysis shows that bias mitigation reduces gender-occupation disparities in the embedding space and produces more neutral and balanced internal representations. These representational shifts appear consistently across both model types, indicating that fairness improvements can be observed as geometric transformations in how the models encode the relevant concepts.

What carries the argument

Measurement of shifts in embedding associations between gender and occupation terms before and after bias mitigation, using similarity metrics to track changes in representational geometry.

If this is right

  • Fairness gains from debiasing can be read directly from geometric changes inside the embedding space.
  • Embedding analysis offers an internal validation method for checking whether a debiasing technique has worked.
  • The same representational effect holds for both encoder-only models like BERT and decoder-only models like Llama2.
  • The released WinoDec dataset enables comparable internal audits on decoder-only architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If neutral embeddings reliably predict lower bias on downstream tasks, internal checks could replace some output-level evaluations.
  • The approach could be applied to other attributes such as race or age to test whether mitigation produces similar geometric patterns.
  • Consistency across model families suggests that debiasing may operate through a shared mechanism of association weakening rather than architecture-specific fixes.

Load-bearing premise

That the observed reductions in embedding associations are caused by the bias mitigation process itself rather than by unrelated training differences or the particular similarity metrics chosen.

What would settle it

Applying the same bias mitigation techniques to BERT or Llama2 and finding no measurable drop in the similarity between gender and occupation embeddings would show that the claimed representational shifts do not occur.

Figures

Figures reproduced from arXiv: 2604.08561 by Brian Mac Namee, Elizabeth Daly, Rahul Nair, Svetoslav Nizhnichenkov.

Figure 1
Figure 1. Figure 1: Self-attention between gender-occupation term pairs for decoder-only models. the use of explanation methods to explore why bias manifests in LLM outputs [34,1]. While these studies provide valuable insights into the presence and mitiga￾tion of bias, they don’t provide the means for explaining how a successful bias￾mitigation strategy manifests itself. This work extends the literature by shifting the focus … view at source ↗
Figure 2
Figure 2. Figure 2: An input sample from the real-world job shortlisting dataset with job descrip￾tion and candidate information. accuracy of 97%. To analyse gender-occupation association differences in the em￾bedding space, we selected two stereotypical occupations, HR and Plumber, and extracted embeddings of the occupation and gender terms for each data sample. Additionally, we constructed a small synthetic dataset comprisi… view at source ↗
Figure 3
Figure 3. Figure 3: Sample instances from the synthetic dataset consisting of stereotypical occu￾pations and gender terms. 4 Experiment Design In this section, we describe how we utilised the models and data introduced in Section 3 to extract embeddings and analyse changes in embedding space resulting from bias mitigation. Our approach focuses on examining contextual embeddings for gender and occupation terms, enabling a deta… view at source ↗
Figure 4
Figure 4. Figure 4: Distributions of cosine similarities between embeddings of male gender terms and the “HR” occupation term and embeddings of female gender terms and the “HR” occupation for the baseline BERT model (left) and the bias-mitigated BERT model (right). with “female”. After bias mitigation, although the KS test still detects a statisti￾cally significant difference (D = 0.1290, p = 0.0003), the lower D-statistic va… view at source ↗
Figure 5
Figure 5. Figure 5: Distributions of cosine similarities between embeddings of male gender term and the job “plumber” and embeddings for female gender terms and “plumber” for the baseline BERT model (left) and the bias-mitigated BERT model (right). as also visible in the KDE plot in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distributions of cosine similarities for embeddings of male and female gender terms split across stereotypically male (left) and female (right) occupations for the baseline and bias-mitigated BERT models. mitigation yields more equitable language representations. This has important implications (e.g., improved fairness in decision-making, increased trust in the system and better alignment with human values… view at source ↗
Figure 7
Figure 7. Figure 7: Distributions of cosine similarities for embeddings of male and female gender terms for configuration “Gender 2 - Occupation 2”, split across stereotypically male jobs for the baseline and bias-mitigated Llama2 models. terms. Specifically, female terms exhibit stronger alignment with stereotypically female occupations, reflecting a gendered semantic association embedded in the model. Following bias mitigat… view at source ↗
Figure 8
Figure 8. Figure 8: Distributions of cosine similarities for embeddings of male and female gender terms for configuration “Gender 1 - Occupation 2”, split across stereotypically male and female jobs for the baseline and bias-mitigated Llama2 models. Post-mitigation (top right plot), the distributions converge towards unimodality with reduced disparity across and within gender groups, suggesting more bal￾anced semantic represe… view at source ↗
read the original abstract

We investigate how successful bias mitigation reshapes the embedding space of encoder-only and decoder-only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias-mitigated variants of the models. Our findings show that bias mitigation reduces gender-occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further promote the assessment of decoder-only models, we introduce WinoDec, a dataset consisting of 4,000 sequences with gender and occupation terms, and release it to the general public. (https://github.com/winodec/wino-dec)

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper investigates how bias mitigation reshapes the embedding spaces of encoder-only (BERT) and decoder-only (Llama2) foundation models by comparing baseline and mitigated variants on gender-occupation associations. It claims that mitigation reduces disparities, produces more neutral representations, and yields consistent geometric shifts across architectures. The work also introduces and releases the WinoDec dataset of 4,000 sequences to support evaluation of decoder-only models.

Significance. If the central empirical claims are supported by quantitative metrics, statistical controls, and ablation evidence isolating the mitigation step, the representational analysis could provide a useful internal audit tool for validating debiasing methods. The public release of WinoDec would be a concrete community contribution for decoder-only fairness evaluation.

major comments (2)
  1. [Abstract] Abstract: the claim that 'bias mitigation reduces gender-occupation disparities' is stated without any quantitative metrics, similarity scores, statistical tests, or description of the embedding comparison procedure, preventing assessment of whether the data support the conclusion.
  2. [Introduction] Introduction and methods (as referenced in the abstract): the before-and-after comparison of baseline and bias-mitigated BERT/Llama2 variants does not establish that the models differ solely in the mitigation step. No statement of matched pre-training conditions, data subsets, or hyperparameter controls is provided, leaving open the possibility that observed embedding shifts arise from unrelated training differences rather than the debiasing technique itself.
minor comments (1)
  1. [Abstract] The GitHub link for WinoDec is given but no details on dataset construction, annotation process, or baseline statistics are supplied in the abstract or introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help us improve the clarity and rigor of our manuscript. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'bias mitigation reduces gender-occupation disparities' is stated without any quantitative metrics, similarity scores, statistical tests, or description of the embedding comparison procedure, preventing assessment of whether the data support the conclusion.

    Authors: We agree with this observation. The abstract in the current version is concise but lacks the supporting quantitative details. In the revised manuscript, we will expand the abstract to include key quantitative metrics (such as cosine similarity differences and statistical significance tests) and a brief description of the embedding comparison procedure. These elements are detailed in Sections 3 and 4 of the paper and will be summarized upfront. revision: yes

  2. Referee: [Introduction] Introduction and methods (as referenced in the abstract): the before-and-after comparison of baseline and bias-mitigated BERT/Llama2 variants does not establish that the models differ solely in the mitigation step. No statement of matched pre-training conditions, data subsets, or hyperparameter controls is provided, leaving open the possibility that observed embedding shifts arise from unrelated training differences rather than the debiasing technique itself.

    Authors: This is a valid concern regarding causal attribution. The bias-mitigated variants we use are established models from prior work (e.g., debiased BERT variants and Llama2 with fairness fine-tuning), where the mitigation is the primary difference from the baseline. To strengthen this, we will add explicit details in the Methods section about the specific model checkpoints used, their training histories as documented in the literature, and any controls applied. We will also include a discussion of potential confounding factors as a limitation if full matching is not feasible. revision: partial

Circularity Check

0 steps flagged

Empirical embedding comparison contains no circular derivations

full rationale

The paper performs a direct before-and-after comparison of gender-occupation associations in the embedding spaces of baseline versus bias-mitigated BERT and Llama2 models. No equations, fitted parameters, or derivations are present that would reduce the reported representational shifts to self-referential definitions or inputs by construction. The analysis relies on empirical measurement of embedding similarities, the introduction of the WinoDec dataset is an independent contribution, and no self-citation chains or uniqueness theorems are invoked as load-bearing premises. The central claim therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that embedding-space distances or similarities reliably capture gender-occupation bias and that mitigation techniques produce measurable geometric changes in those spaces.

axioms (1)
  • domain assumption Embedding spaces capture semantic associations relevant to bias
    Invoked when the paper treats shifts in gender-occupation term associations as evidence of successful debiasing.

pith-pipeline@v0.9.0 · 5464 in / 1152 out tokens · 36029 ms · 2026-05-15T09:46:59.441835+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    arXiv (2025)

    Amara, K., Sevastjanova, R., El-Assady, M.: Concept-level explainability for au- diting & steering LLM responses. arXiv (2025)

  2. [2]

    In: Advances in Neural Information Processing Systems (2016)

    Bolukbasi,T.,Chang,K.,Zou,J.Y.,Saligrama,V.,Kalai,A.T.:Manistocomputer programmer as woman is to homemaker? debiasing word embeddings. In: Advances in Neural Information Processing Systems (2016)

  3. [3]

    Science (2017)

    Caliskan, A., Bryson, J.J., Narayanan, A.: Semantics derived automatically from language corpora contain human-like biases. Science (2017)

  4. [4]

    In: Association for Computational Linguistics (2019)

    Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec- tional transformers for language understanding. In: Association for Computational Linguistics (2019)

  5. [5]

    Gallegos, I.O., Rossi, R.A., Barrow, J., Tanjim, M.M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., Ahmed, N.K.: Bias and fairness in large language models: A survey. Comput. Linguistics (2024)

  6. [6]

    In: Interna- tional Encyclopedia of Statistical Science

    Gibbons, J.D., Chakraborti, S.: Nonparametric statistical inference. In: Interna- tional Encyclopedia of Statistical Science. Springer (2011) A Representation-Level Assessment of Bias Mitigation in Foundation Models 15

  7. [7]

    In: International Conference on Learning Representations (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

  8. [8]

    In: Association for Computational Linguistics (2021)

    Jin, X., Barbieri, F., Kennedy, B., Davani, A.M., Neves, L., Ren, X.: On transfer- ability of bias mitigation effects in language model fine-tuning. In: Association for Computational Linguistics (2021)

  9. [9]

    arXiv (2022)

    Joniak, P.K., Aizawa, A.: Gender biases and where to find them: Exploring gender bias in pre-trained transformer-based language models using movement pruning. arXiv (2022)

  10. [10]

    In: Association for Computational Linguistics (2023)

    Kaneko, M., Bollegala, D., Okazaki, N.: The impact of debiasing on the perfor- mance of language models in downstream tasks is underestimated. In: Association for Computational Linguistics (2023)

  11. [11]

    In: TextMining Workshop at KDD2000 (2000)

    Karypis, M.S.G., Kumar, V., Steinbach, M.: A comparison of document clustering techniques. In: TextMining Workshop at KDD2000 (2000)

  12. [12]

    In: Logic, Language, and Security - Essays Dedicated to Andre Scedrov on the Occasion of His 65th Birthday

    Lu, K., Mardziel, P., Wu, F., Amancharla, P., Datta, A.: Gender bias in neural natural language processing. In: Logic, Language, and Security - Essays Dedicated to Andre Scedrov on the Occasion of His 65th Birthday. Springer (2020)

  13. [13]

    Association for Computational Linguistics (2019)

    Maudslay, R.H., Gonen, H., Cotterell, R., Teufel, S.: It’s all in the name: Mitigat- ing gender bias with name-based counterfactual data substitution. Association for Computational Linguistics (2019)

  14. [14]

    In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (2019)

    May, C., Wang, A., Bordia, S., Bowman, S.R., Rudinger, R.: On measuring social biases in sentence encoders. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (2019)

  15. [15]

    ACM Comput

    Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv. (2022)

  16. [16]

    In: 1st International Conference on Learning Represen- tations (2013)

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. In: 1st International Conference on Learning Represen- tations (2013)

  17. [17]

    arXiv (2024)

    Mohammadi, B.: Creativity has left the chat: The price of debiasing language models. arXiv (2024)

  18. [18]

    In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Con- ference on Natural Language Processing (2021)

    Nadeem, M., Bethke, A., Reddy, S.: Stereoset: Measuring stereotypical bias in pretrained language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Con- ference on Natural Language Processing (2021)

  19. [19]

    In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (2020)

    Nangia, N., Vania, C., Bhalerao, R., Bowman, S.R.: Crows-pairs: A challenge dataset for measuring social biases in masked language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (2020)

  20. [20]

    Nomelini, G.G., Marcolin, C.B.: Gender bias in large language models: A job post- ings analysis. RAM. Revista de Administração Mackenzie (2024)

  21. [21]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P.F., Leike, J., Lowe, R.: Traininglanguagemodelstofollowinstructionswithhumanfeedback.In:Advances in Neural Information Processing Systems (2022)

  22. [22]

    In: Findings of the Association for Computational Linguistics (2022)

    Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P.M., Bowman, S.R.: BBQ: A hand-built bias benchmark for question an- swering. In: Findings of the Association for Computational Linguistics (2022)

  23. [23]

    In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014) 16 Nizhnichenkov et al

    Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word represen- tation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014) 16 Nizhnichenkov et al

  24. [24]

    In: Proceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Processing (2018)

    Peters, M.E., Neumann, M., Zettlemoyer, L., Yih, W.t.: Dissecting contextual word embeddings: Architecture and representation. In: Proceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Processing (2018)

  25. [25]

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving lan- guage understanding by generative pre-training (2018)

  26. [26]

    arXiv (2024)

    Rakivnenko, V., Maslej, N., Cervi, J., Zhukov, V.: Bias in text embedding models. arXiv (2024)

  27. [27]

    In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (2018)

    Rudinger, R., Naradowsky, J., Leonard, B., Durme, B.V.: Gender bias in coref- erence resolution. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (2018)

  28. [28]

    In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (2022)

    Tokpo, E.K., Calders, T.: Text style transfer for bias mitigation using masked language modeling. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (2022)

  29. [29]

    arXiv (2023)

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov,N.,Batra,S.,Bhargava,P.,Bhosale,S.,Bikel,D.,Blecher,L.,Canton-Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., ...

  30. [30]

    Advances in Neural Information Pro- cessing Systems (2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Pro- cessing Systems (2017)

  31. [31]

    In: Proceedings ofthe58thAnnualMeetingoftheAssociationforComputationalLinguistics(2020)

    Wang, T., Lin, X.V., Rajani, N.F., McCann, B., Ordonez, V., Xiong, C.: Double- hard debias: Tailoring word embeddings for gender bias mitigation. In: Proceedings ofthe58thAnnualMeetingoftheAssociationforComputationalLinguistics(2020)

  32. [32]

    arXiv (2020)

    Webster, K., Wang, X., Tenney, I., Beutel, A., Pitler, E., Pavlick, E., Chen, J., Petrov, S.: Measuring and reducing gendered correlations in pre-trained models. arXiv (2020)

  33. [33]

    In: Proceedings of the Seventh AAAI/ACM Confer- ence on AI, Ethics, and Society (2024)

    Wilson, K., Caliskan, A.: Gender, race, and intersectional bias in resume screening via language model retrieval. In: Proceedings of the Seventh AAAI/ACM Confer- ence on AI, Ethics, and Society (2024)

  34. [34]

    arXiv (2024)

    Wu, Z., Bulathwela, S., Perez-Ortiz, M., Koshiyama, A.S.: Stereotype detection in llms: A multiclass, explainable, and benchmark-driven approach. arXiv (2024)

  35. [35]

    In: Findings of the Association for Computational Linguistics (2023)

    Yu, C., Jeoung, S., Kasi, A., Yu, P., Ji, H.: Unlearning bias in language mod- els by partitioning gradients. In: Findings of the Association for Computational Linguistics (2023)

  36. [36]

    In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (2019)

    Zhao, J., Wang, T., Yatskar, M., Cotterell, R., Ordonez, V., Chang, K.: Gender bias in contextualized word embeddings. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (2019)

  37. [37]

    In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (2018)

    Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.: Gender bias in coref- erence resolution: Evaluation and debiasing methods. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (2018)