pith. sign in

arxiv: 2603.24125 · v2 · pith:GAEAA7U3new · submitted 2026-03-25 · 💻 cs.CL

Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Pith reviewed 2026-05-15 00:36 UTC · model grok-4.3

classification 💻 cs.CL
keywords gender biaslarge language modelsalignmentinternal representationsfine-tuningadversarial promptingexpressed bias
0
0 comments X

The pith

Alignment reduces expressed gender bias in outputs but leaves measurable associations intact inside the model's representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a single set of neutral prompts to measure both the gender information stored in a model's internal states and the bias appearing in its generated text. This protocol reveals a steady link between the two, unlike earlier studies that found weak or varying connections. When models undergo supervised fine-tuning to lessen bias, the bias drops in ordinary outputs yet the internal associations stay detectable and can be triggered again by adversarial prompts. The work also shows that bias reductions seen on fixed benchmarks often fail to carry over to open-ended tasks such as story generation.

Core claim

The central claim is that supervised fine-tuning aimed at gender-bias reduction lowers bias in generated outputs while gender-related associations remain present in the model's internal representations and can be reactivated under adversarial prompting; debiasing effects observed on structured benchmarks do not necessarily generalize to realistic settings such as story generation.

What carries the argument

A unified measurement framework that applies the same neutral prompts to extract latent gender information from internal representations and to quantify bias in generated outputs.

If this is right

  • Alignment through fine-tuning reliably lowers bias in standard generated responses.
  • Internal gender associations survive alignment and remain accessible.
  • Adversarial prompting can restore the bias that alignment had suppressed.
  • Bias reductions measured on fixed benchmarks do not automatically appear in open-ended generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluations of alignment success may need to include checks for internal representations in addition to output tests.
  • True removal of encoded associations could require architectural changes beyond output-level fine-tuning.
  • Safety testing for deployed models should incorporate adversarial prompt suites to detect hidden associations.

Load-bearing premise

The neutral prompts and extraction methods chosen for internal representations accurately reflect the model's genuine encoded gender associations rather than artifacts of prompt wording or measurement technique.

What would settle it

A direct test in which adversarial prompts applied after alignment produce no measurable increase in expressed gender bias, or in which internal extraction methods detect no remaining gender associations.

Figures

Figures reproduced from arXiv: 2603.24125 by Christophe Marsala, Marcin Detyniecki, Marie-Jeanne Lesot, Nour Bouchouchi, Thibault Laugel, Xavier Renard.

Figure 1
Figure 1. Figure 1: Distribution of generated genders (female/male/neutral) and entity-level [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Concept-level polarization score Biaspol(c) for 6 concepts studied across 3 models (gemma, Llama and Mistral from left to right) and 3 conditions (before fine-tuning, after fine-tuning, and after fine-tuning with a jailbreak instruction). changeable neutral personas P = {My friend, Someone I know, This person, A person, An individual, A person I met}. Generation and Annotation For each prompt xp,e, we gene… view at source ↗
Figure 3
Figure 3. Figure 3: Entity-level latent gender score s 20(e) for Llama, before and after fine￾tuning for the concepts Professions (top) and Diseases (bottom). 0 5 10 15 20 25 30 Layer 0.000 0.005 0.010 0.015 0.020 S l late nt Before FT 0 5 10 15 20 25 30 Layer After FT Concepts Professions Sports Months Colors Languages Diseases Random Professions Random Sports Random Months Random Colors Random Languages Random Diseases [PI… view at source ↗
Figure 4
Figure 4. Figure 4: Latent polarization score Sl latent(c) per concept across layers for Llama (before and after fine-tuning), compared to concept-specific random reference distributions (shaded areas indicate the 2.5%-97.5% quantile interval). 4.4 Is there a Relationship Between Intrinsic and Extrinsic Bias? We now examine the relationship between intrinsic and extrinsic bias through two analyses: correlation, to measure con… view at source ↗
Figure 5
Figure 5. Figure 5: Spearman correlation between expressed bias and latent gender scores by [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Concept-level polarization score Biaspol(c) for Llama: before fine-tuning, before fine-tuning with direction ablation, after fine-tuning with jailbreak in￾struction, and after fine-tuning with jailbreak instruction and direction ablation. 5 Bias in Realistic Generation Tasks The previous experiments rely on structured prompt completion, which provides controlled conditions for measuring bias but may not re… view at source ↗
Figure 7
Figure 7. Figure 7: Results on the unified framework to the two realistic generation tasks [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt used for automatic gender annotation. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Concept-level polarization score Biaspol(c) for the 6 concepts studied across 3 models (gemma on the left, Llama in the middle, and Mistral on the right) and 3 conditions (initial model, after fine-tuning, and after fine-tuning with a jailbreak instruction). ESL Stories 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 B i a s p ol Method Before FT After FT After FT + Jailbreak instruction ESL Stories 0.0 0.1 0.2 0.3 0.4 0.… view at source ↗
Figure 10
Figure 10. Figure 10: Concept-level polarization score Biaspol(c) for the 2 realistic tasks studied across 3 models (gemma on the left, Llama in the middle, and Mistral on the right) and 3 conditions (initial model, after fine-tuning, and after fine￾tuning with a jailbreak instruction). B.3 Relationship between Extrinsic and Intrinsic Bias Correlation Figures 13 and 14 show the correlations between output bias and latent score… view at source ↗
Figure 11
Figure 11. Figure 11: Sl latent(c) for 6 concepts across layers, for 3 models (before and after fine-tuning), compared to the random distributions specific to each concept [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sl latent(c) for 2 realistic tasks across layers, for 3 models (before and after fine-tuning), compared to the random distributions specific to each concept [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Spearman correlation between output bias and latent scores across layer, [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Spearman correlation between output bias and latent scores across layer, [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Concept-level polarization score Biaspol(c) for 6 concepts across layers for 3 models under 4 conditions: initial model, initial model with directional ablation, fine-tuned model with jailbreak instruction, and fine-tuned model with jailbreak instructions and directional ablation [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Concept-level polarization score Biaspol(c) for 2 realistic tasks across layers for 3 models under 4 conditions: initial model, initial model with di￾rectional ablation, fine-tuned model with jailbreak instruction, and fine-tuned model with jailbreak instructions and directional ablation [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
read the original abstract

During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes a unified framework that applies identical neutral prompts to jointly measure intrinsic gender bias (via latent associations extracted from internal representations) and extrinsic gender bias (via generated outputs) in LLMs. It reports a consistent correlation between the two—contrary to prior weak or inconsistent findings—shows that supervised fine-tuning alignment reduces expressed bias while leaving measurable encoded associations intact and reactivatable via adversarial prompts, and demonstrates that debiasing gains on structured benchmarks fail to generalize to realistic tasks such as story generation.

Significance. If the measurements prove robust, the work is significant for establishing a more ecologically valid protocol than structured benchmarks alone and for showing that alignment techniques may suppress surface-level outputs without altering underlying representations. The unified comparison, reactivation results, and generalization failure provide concrete guidance for future debiasing research and highlight limitations in current evaluation practices.

major comments (3)
  1. [§4.1] §4.1 (Representation extraction): The method for isolating latent gender information from hidden states on neutral prompts requires explicit specification of the probe architecture, layer selection, and any statistical controls for prompt sensitivity; without these, the claim that encoded associations persist post-alignment and are not measurement artifacts cannot be fully evaluated, as this extraction is load-bearing for the central contrast with prior work.
  2. [§5.2] §5.2 (Adversarial reactivation): The adversarial prompts used to reactivate gender associations after SFT must be shown to remain within the neutral category or accompanied by controls demonstrating that reactivation reflects pre-existing encodings rather than prompt-induced amplification; this directly affects the interpretation that alignment leaves encodings unchanged.
  3. [§4] Results tables (e.g., correlation tables in §4): The reported consistent associations should include direct head-to-head comparisons with the specific methods and prompt sets from prior studies that found weak correlations, along with effect sizes and robustness across multiple random seeds, to substantiate the discrepancy.
minor comments (3)
  1. The full set of neutral prompts and adversarial variants should be included in an appendix to support reproducibility of the unified protocol.
  2. [Figure 3] Figure captions for reactivation plots should explicitly state the number of samples and any error bars or confidence intervals used.
  3. [§3.1] Clarify the exact definition of 'neutral' prompts in §3.1 to avoid ambiguity in how they differ from structured benchmark prompts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas for improving methodological transparency and empirical robustness. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4.1] §4.1 (Representation extraction): The method for isolating latent gender information from hidden states on neutral prompts requires explicit specification of the probe architecture, layer selection, and any statistical controls for prompt sensitivity; without these, the claim that encoded associations persist post-alignment and are not measurement artifacts cannot be fully evaluated, as this extraction is load-bearing for the central contrast with prior work.

    Authors: We agree that greater specificity on the representation extraction procedure is warranted to support reproducibility and evaluation of our central claims. In the revised manuscript, we will expand §4.1 with a dedicated paragraph specifying the probe as an L2-regularized logistic regression classifier, layer selection via cross-validated accuracy maximization across layers (with the final layer typically optimal), and statistical controls including label-permutation tests on neutral prompts to quantify and rule out prompt-sensitivity artifacts. These details will directly bolster the contrast with prior work. revision: yes

  2. Referee: [§5.2] §5.2 (Adversarial reactivation): The adversarial prompts used to reactivate gender associations after SFT must be shown to remain within the neutral category or accompanied by controls demonstrating that reactivation reflects pre-existing encodings rather than prompt-induced amplification; this directly affects the interpretation that alignment leaves encodings unchanged.

    Authors: We acknowledge the importance of ruling out prompt-induced amplification for the reactivation interpretation. The adversarial prompts were constructed to preserve semantic neutrality while providing minimal triggering cues, as confirmed by pilot evaluations. In revision, we will add explicit controls in §5.2: applying identical adversarial prompts to an SFT model trained on non-gender data and reporting that no comparable gender bias reactivation occurs, thereby demonstrating that the effect relies on pre-existing encodings rather than the prompts alone. Quantitative results and prompt examples will be included. revision: yes

  3. Referee: [§4] Results tables (e.g., correlation tables in §4): The reported consistent associations should include direct head-to-head comparisons with the specific methods and prompt sets from prior studies that found weak correlations, along with effect sizes and robustness across multiple random seeds, to substantiate the discrepancy.

    Authors: We agree that head-to-head comparisons would better substantiate the discrepancy with prior inconsistent findings. We will add a new subsection and table in §4 that directly compares our unified-protocol correlations against those obtained by re-implementing key prior methods and prompt sets on the same model suite, reporting Pearson r values, effect sizes, and standard deviations across three random seeds. This analysis will highlight how the identical neutral-prompt design yields more stable associations. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with self-contained protocol

full rationale

The paper introduces a new unified protocol that applies identical neutral prompts to measure both latent gender information in internal representations and expressed bias in outputs. Central claims rest on direct empirical observations of associations and the effects of supervised fine-tuning, without any derivations, parameter fits renamed as predictions, or load-bearing self-citations. No step reduces by construction to its own inputs; results are presented as measurements under the stated protocol and are falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of neutral-prompt elicitation for latent representations and the assumption that supervised fine-tuning affects outputs differently from internal encodings; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Neutral prompts can reliably surface gender-related information from internal representations without themselves introducing or masking bias.
    This underpins the direct comparison between encoded and expressed bias in the unified framework.

pith-pipeline@v0.9.0 · 5512 in / 1220 out tokens · 28796 ms · 2026-05-15T00:36:37.616118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    In: NeurIPS (2024)

    Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., Nanda, N.: Refusal in language models is mediated by a single direction. In: NeurIPS (2024)

  2. [2]

    In: NeurIPS (2016)

    Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In: NeurIPS (2016)

  3. [3]

    Science356, 183–186 (2017)

    Caliskan, A., Bryson, J.J., Narayanan, A.: Semantics derived automatically from language corpora contain human-like biases. Science356, 183–186 (2017)

  4. [4]

    In: ACL (2022)

    Cao, Y.T., Pruksachatkun, Y., Chang, K.W., Gupta, R., Kumar, V., Dhamala, J., Galstyan, A.: On the intrinsic and extrinsic fairness evaluation metrics for contex- tualized language representations. In: ACL (2022)

  5. [5]

    Psychology of sport and exercise 14, 136–144 (2013)

    Chalabaev, A., Sarrazin, P., Fontayne, P., Boiché, J., Clément-Guillotin, C.: The influence of sex stereotypes and gender roles on participation and performance in sport and exercise: Review and future directions. Psychology of sport and exercise 14, 136–144 (2013)

  6. [6]

    In: ICML (2024)

    Chen, H., Vondrick, C., Mao, C.: Selfie: self-interpretation of large language model embeddings. In: ICML (2024)

  7. [7]

    In: ACM Conf

    Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.W., Gupta, R.: Bold: Dataset and metrics for measuring biases in open-ended language generation. In: ACM Conf. on FAccT (2021) 16 N. Bouchouchi et al

  8. [8]

    Women do not have heart attacks!

    Ducel, F., Hiebel, N., Ferret, O., Fort, K., Névéol, A.: “Women do not have heart attacks!” Gender Biases in Automatically Generated Clinical Cases in French. In: Findings NAACL (2025)

  9. [9]

    The Counseling Psychologist37, 902–922 (2009)

    Gadassi, R., Gati, I.: The effect of gender stereotypes on explicit and implicit career preferences. The Counseling Psychologist37, 902–922 (2009)

  10. [10]

    Computational Linguistics50, 1097–1179 (2024)

    Gallegos, I.O., Rossi, R.A., Barrow, J., Tanjim, M.M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., Ahmed, N.K.: Bias and fairness in large language models: A survey. Computational Linguistics50, 1097–1179 (2024)

  11. [11]

    In: ACL-IJCNLP (2021)

    Goldfarb-Tarrant, S., Marchant, R., Sánchez, R.M., Pandya, M., Lopez, A.: Intrin- sic bias metrics do not correlate with application bias. In: ACL-IJCNLP (2021)

  12. [12]

    In: arXiv (2024)

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., et al.: The Llama 3 Herd of Models. In: arXiv (2024)

  13. [13]

    Journal of personality and social psychology74, 1464 (1998)

    Greenwald, A.G., McGhee, D.E., Schwartz, J.L.: Measuring individual differences in implicit cognition: the implicit association test. Journal of personality and social psychology74, 1464 (1998)

  14. [14]

    In: AIES (2021)

    Guo, W., Caliskan, A.: Detecting emergent intersectional biases: Contextualized word embeddings contain a distribution of human-like biases. In: AIES (2021)

  15. [15]

    ICLR (2021)

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring Massive Multitask Language Understanding. ICLR (2021)

  16. [16]

    In: ICLR (2022)

    Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-Rank Adaptation of Large Language Models. In: ICLR (2022)

  17. [17]

    In: arXiv (2023)

    Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., et al.: Mistral 7B. In: arXiv (2023)

  18. [18]

    Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know? TACL (2020)

  19. [19]

    Sex Roles80, 630–642 (2019)

    Jonauskaite, D., Dael, N., Chèvre, L., Althaus, B., Tremea, A., Charalambides, L., Mohr, C.: Pink for girls, red for boys, and blue for both genders: Colour preferences in children and adults. Sex Roles80, 630–642 (2019)

  20. [20]

    arXiv (2025)

    Lin, X., Li, L.: Implicit bias in LLMs: A survey. arXiv (2025)

  21. [21]

    In: ACL (2025)

    Lum, K., Anthis, J.R., Robinson, K., Nagpal, C., D’Amour, A.N.: Bias in language models: Beyond trick tests and towards RUTEd evaluation. In: ACL (2025)

  22. [22]

    In: NAACL (2019)

    May, C., Wang, A., Bordia, S., Bowman, S., Rudinger, R.: On measuring social biases in sentence encoders. In: NAACL (2019)

  23. [23]

    In: ACL-IJCNLP (2021)

    Nadeem, M., Bethke, A., Reddy, S.: StereoSet: Measuring stereotypical bias in pretrained language models. In: ACL-IJCNLP (2021)

  24. [24]

    In: EMNLP (2020)

    Nangia, N., Vania, C., Bhalerao, R., Bowman, S.R.: CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. In: EMNLP (2020)

  25. [25]

    In: ICML (2025)

    Pan, W., Liu, Z., Chen, Q., Zhou, X., Haining, Y., Jia, X.: The hidden dimensions of LLM alignment: A multi-dimensional analysis of orthogonal safety directions. In: ICML (2025)

  26. [26]

    In: ICLR (2025)

    Park, K., Choe, Y.J., Jiang, Y., Veitch, V.: The Geometry of Categorical and Hierarchical Concepts in Large Language Models. In: ICLR (2025)

  27. [27]

    In: ICML (2024)

    Park, K., Choe, Y.J., Veitch, V.: The Linear Representation Hypothesis and the Geometry of Large Language Models. In: ICML (2024)

  28. [28]

    In: Findings ACL (2022)

    Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P.M., Bowman, S.: BBQ: A hand-built bias benchmark for question answer- ing. In: Findings ACL (2022)

  29. [29]

    In: EMNLP (2025)

    Rooein, D., Zouhar, V., Nozza, D., Hovy, D.: Biased tales: Cultural and topic bias in generating children’s stories. In: EMNLP (2025)

  30. [30]

    In: arXiv (2024) Alignment Reduces Expressed but Not Encoded Gender Bias 17

    Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., et al.: Gemma: Open Models Based on Gemini Research and Technology. In: arXiv (2024) Alignment Reduces Expressed but Not Encoded Gender Bias 17

  31. [31]

    Kelly is a Warm Person, Joseph is a Role Model

    Wan, Y., Pu, G., Sun, J., Garimella, A., Chang, K.W., Peng, N.: “Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Let- ters. In: Findings EMNLP (2023)

  32. [32]

    arXiv (2020)

    Webster, K., Wang, X., Tenney, I., Beutel, A., Pitler, E., Pavlick, E., Chen, J., Chi, E., Petrov, S.: Measuring and reducing gendered correlations in pre-trained models. arXiv (2020)

  33. [33]

    In: ACL (2025)

    Zhang, T., Zeng, Z., YuxiangXiao, Y., Zhuang, H., Chen, C., Foulds, J.R., Pan, S.: GenderAlign:AnAlignmentDatasetforMitigatingGenderBiasinLargeLanguage Models. In: ACL (2025)

  34. [34]

    In: NAACL (2018)

    Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.W.: Gender Bias in Coref- erence Resolution: Evaluation and Debiasing Methods. In: NAACL (2018)

  35. [35]

    arXiv (2023)

    Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., Hou, L.: Instruction-Following Evaluation for Large Language Models. arXiv (2023)

  36. [36]

    he", "she

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv (2023) A Reproducibility Code will be made publicly available upon acceptance of the paper. A.1 Models and fine-tuning Models.The models used (Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.1, and gemma-7b...