Considerations for the Interpretation of Bias Measures of Word Embeddings

Anthony Schulte; Inom Mirzaev; Michael Conover; Sam Shah

arxiv: 1906.08379 · v1 · pith:GIGPHQYInew · submitted 2019-06-19 · 💻 cs.CL

Considerations for the Interpretation of Bias Measures of Word Embeddings

Inom Mirzaev , Anthony Schulte , Michael Conover , Sam Shah This is my paper

Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords word embeddingsbias metricshyper-parameter sensitivitycorpus biasBolukbasi metricembedding trainingsocietal biasvariance analysis

0 comments

The pith

The bias metric for word embeddings varies more with hyper-parameter choices than with the underlying corpus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how a standard bias metric for word embeddings responds to different training settings and data sources. It shows that varying hyper-parameters in the embedding algorithm often produces larger swings in the bias score than switching between different text corpora. This pattern holds across many cases examined, though corpus rankings stay more stable in some situations. The finding implies that the metric tracks features of the trained embedding space at least as much as properties of the original text.

Core claim

The bias metric proposed by Bolukbasi et al. 2016 is highly sensitive to embedding hyper-parameter selection, and in many cases the variance due to hyper-parameter selection exceeds the variance due to corpus selection, while in fewer cases the bias rankings of corpora vary with hyper-parameter selection. Bias estimates should therefore be understood as measuring properties of the specific embedding spaces rather than directly the underlying corpus, and comparisons of bias metrics across spaces generated with differing hyper-parameters should account for the embedding-learning algorithm configurations.

What carries the argument

The Bolukbasi et al. 2016 bias metric, which measures directional associations between sets of word vectors to quantify bias.

If this is right

Bias estimates reflect the embedding space configuration more than the corpus alone.
Direct comparisons of bias metrics across embeddings with different hyper-parameters may be misleading without accounting for training settings.
The metric's utility for quantifying corpus biases is limited when hyper-parameters vary.
Researchers should report hyper-parameter details when presenting bias measurements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New bias metrics that remain stable across hyper-parameter choices could enable cleaner comparisons of corpus bias.
Downstream NLP systems may exhibit different bias levels depending on which hyper-parameter settings are used during embedding training.
Bias mitigation methods evaluated with this metric should include checks for sensitivity to the training configuration.

Load-bearing premise

The experiments cover representative hyper-parameter ranges and corpora such that the variance comparison generalizes to typical use cases.

What would settle it

A study that measures the same bias metric across many corpora while sweeping the same hyper-parameter ranges and finds that corpus variance consistently exceeds hyper-parameter variance would falsify the main sensitivity claim.

Figures

Figures reproduced from arXiv: 1906.08379 by Anthony Schulte, Inom Mirzaev, Michael Conover, Sam Shah.

**Figure 2.** Figure 2: Kendall tau rank correlation coefficients for pairs of word embeddings trained [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Bias score decay with increasing dimension [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Average magnitude of cosine similarity between 1000 randomly sampled word pairs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Distributions of direct bias scores of various online corpora under bootstrapping [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Word embedding spaces are powerful tools for capturing latent semantic relationships between terms in corpora, and have become widely popular for building state-of-the-art natural language processing algorithms. However, studies have shown that societal biases present in text corpora may be incorporated into the word embedding spaces learned from them. Thus, there is an ethical concern that human-like biases contained in the corpora and their derived embedding spaces might be propagated, or even amplified with the usage of the biased embedding spaces in downstream applications. In an attempt to quantify these biases so that they may be better understood and studied, several bias metrics have been proposed. We explore the statistical properties of these proposed measures in the context of their cited applications as well as their supposed utilities. We find that there are caveats to the simple interpretation of these metrics as proposed. We find that the bias metric proposed by Bolukbasi et al. 2016 is highly sensitive to embedding hyper-parameter selection, and that in many cases, the variance due to the selection of some hyper-parameters is greater than the variance in the metric due to corpus selection, while in fewer cases the bias rankings of corpora vary with hyper-parameter selection. In light of these observations, it may be the case that bias estimates should not be thought to directly measure the properties of the underlying corpus, but rather the properties of the specific embedding spaces in question, particularly in the context of hyper-parameter selections used to generate them. Hence, bias metrics of spaces generated with differing hyper-parameters should be compared only with explicit consideration of the embedding-learning algorithms particular configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Bolukbasi metric varies more with some hyper-parameter choices than with corpus in the tested cases, but the paper needs to show the actual ranges and stats to make the interpretation claim stick.

read the letter

The main thing to know is that this paper reports the Bolukbasi et al. 2016 bias metric can show larger variance from embedding hyper-parameter choices than from switching corpora, and that corpus rankings sometimes flip with those choices. They conclude the metric should be read as describing the specific embedding space rather than the corpus alone. That observation is new enough as an empirical note on an existing metric, and it is a reasonable caution for people running bias audits on embeddings. The work is straightforward in pointing out that the metric is not corpus-invariant under different training setups. Credit for running the comparisons and surfacing the sensitivity. The soft spot is exactly the one in the stress-test note: without the actual hyper-parameter grid, window sizes, dimensions, and how they were sampled, it is difficult to tell whether the hyper-parameter variance is inflated by including settings no one uses in practice. The abstract gives no numbers on ranges or statistical tests, so the claim that hyper-parameter effects often dominate corpus effects rests on unshown details. If the grid was narrow and realistic, the result is more useful; if it was broad and extreme, the interpretation advice weakens. This is for researchers who apply or extend bias metrics in embeddings and fairness work. It is worth sending to peer review so the experimental setup can be checked and the variance numbers can be reproduced or challenged. The thinking is clear and the concern is practical even if the evidence needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper examines statistical properties of bias metrics for word embeddings, with focus on the Bolukbasi et al. (2016) metric. It reports that this metric is highly sensitive to embedding hyper-parameter selection, such that in many cases the variance attributable to hyper-parameter choices exceeds the variance due to corpus selection, and that in some cases the bias rankings of corpora change with hyper-parameter selection. The authors conclude that bias estimates should be interpreted as properties of the specific embedding spaces (accounting for hyper-parameter configurations) rather than direct measures of the underlying corpora.

Significance. If the empirical variance comparisons hold under representative conditions, the work is significant for NLP bias research: it supplies concrete evidence against treating bias metrics as corpus-intrinsic quantities and supports more cautious experimental reporting when comparing embeddings. The direct variance decomposition is a strength of the approach.

major comments (1)

[hyper-parameter experiments] The central claim (that hyper-parameter-induced variance often exceeds corpus-induced variance and that rankings can change) is load-bearing for the interpretation recommendation in the abstract. However, the manuscript must demonstrate that the tested hyper-parameter grid (dimensions, window sizes, etc.) reflects settings actually used in the literature; if extreme or atypical values are included, the variance comparison does not necessarily generalize to standard configurations, undermining the advice that bias metrics should not be read as corpus properties.

minor comments (1)

[Abstract] The abstract is clear on the main findings but would benefit from a brief parenthetical note on the specific corpora and hyper-parameter ranges examined, to allow immediate assessment of scope.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The major comment raises an important point about generalizability of the hyper-parameter results, which we address below with a commitment to strengthen the manuscript.

read point-by-point responses

Referee: [hyper-parameter experiments] The central claim (that hyper-parameter-induced variance often exceeds corpus-induced variance and that rankings can change) is load-bearing for the interpretation recommendation in the abstract. However, the manuscript must demonstrate that the tested hyper-parameter grid (dimensions, window sizes, etc.) reflects settings actually used in the literature; if extreme or atypical values are included, the variance comparison does not necessarily generalize to standard configurations, undermining the advice that bias metrics should not be read as corpus properties.

Authors: We agree that explicit comparison to literature-standard configurations is necessary to support the claim's generalizability. Our grid was chosen to include widely reported values (e.g., embedding dimensions 50/100/300/500, window sizes 2/5/10, negative samples 5/10) drawn from Mikolov et al. (2013), Pennington et al. (2014), and bias-evaluation papers; however, the manuscript does not currently include a side-by-side table. We will add such a table (and restrict the primary variance analysis to the overlapping subset of standard settings) to demonstrate that the reported variance dominance and ranking instability persist even under representative configurations. This revision directly addresses the concern while preserving the core empirical findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical variance comparisons are independent of inputs

full rationale

The paper's claims rest on direct empirical computation of the Bolukbasi bias metric across varied hyper-parameter grids and multiple corpora, followed by variance decomposition. These steps involve no self-definitional equations, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. The derivation chain consists of standard statistical comparisons on externally generated embeddings and is self-contained against the observed data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No new parameters, axioms, or entities introduced; the work is an empirical analysis of existing bias metrics.

pith-pipeline@v0.9.0 · 5814 in / 912 out tokens · 27119 ms · 2026-05-25T20:04:43.860161+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Evaluating the Stability of Embedding-Based Word Similarities

Antoniak, Maria and David Mimno (2018). “Evaluating the Stability of Embedding-Based Word Similarities”. In: Transactions of the Association for Computational Linguistics 6.0, pp. 107–119. issn: 2307-387X

work page 2018
[2]

Man Is to Computer Programmer as Woman Is to Home- maker? Debiasing Word Embeddings

Bolukbasi, Tolga et al. (2016). “Man Is to Computer Programmer as Woman Is to Home- maker? Debiasing Word Embeddings”. In: Advances in Neural Information Processing Systems 29 . Ed. by D. D. Lee et al. Curran Associates, Inc., pp. 4349–4357

work page 2016
[3]

Understanding the Origins of Bias in Word Embeddings

Brunet, Marc-Etienne et al. (2018). “Understanding the Origins of Bias in Word Embed- dings”. In: arXiv: 1810.03611 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Semantics Derived Au- tomatically from Language Corpora Contain Human-like Biases

Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan (2017). “Semantics Derived Au- tomatically from Language Corpora Contain Human-like Biases”. In: Science 356.6334, pp. 183–186. issn: 0036-8075, 1095-9203. 8

work page 2017
[5]

Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes

Garg, Nikhil et al. (2018). “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes”. In: Proceedings of the National Academy of Sciences 115.16, E3635–E3644. issn: 0027-8424, 1091-6490

work page 2018
[6]

Efficient Estimation of Word Representations in Vector Space

Mikolov, Tomas et al. (2013). “Eﬃcient Estimation of Word Representations in Vector Space”. In: arXiv: 1301.3781 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2013
[7]

Glove: Global Vec- tors for Word Representation

Pennington, Jeﬀrey, Richard Socher, and Christopher Manning (2014). “Glove: Global Vec- tors for Word Representation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Doha, Qatar: Association for Com- putational Linguistics, pp. 1532–1543. 9

work page 2014

[1] [1]

Evaluating the Stability of Embedding-Based Word Similarities

Antoniak, Maria and David Mimno (2018). “Evaluating the Stability of Embedding-Based Word Similarities”. In: Transactions of the Association for Computational Linguistics 6.0, pp. 107–119. issn: 2307-387X

work page 2018

[2] [2]

Man Is to Computer Programmer as Woman Is to Home- maker? Debiasing Word Embeddings

Bolukbasi, Tolga et al. (2016). “Man Is to Computer Programmer as Woman Is to Home- maker? Debiasing Word Embeddings”. In: Advances in Neural Information Processing Systems 29 . Ed. by D. D. Lee et al. Curran Associates, Inc., pp. 4349–4357

work page 2016

[3] [3]

Understanding the Origins of Bias in Word Embeddings

Brunet, Marc-Etienne et al. (2018). “Understanding the Origins of Bias in Word Embed- dings”. In: arXiv: 1810.03611 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Semantics Derived Au- tomatically from Language Corpora Contain Human-like Biases

Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan (2017). “Semantics Derived Au- tomatically from Language Corpora Contain Human-like Biases”. In: Science 356.6334, pp. 183–186. issn: 0036-8075, 1095-9203. 8

work page 2017

[5] [5]

Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes

Garg, Nikhil et al. (2018). “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes”. In: Proceedings of the National Academy of Sciences 115.16, E3635–E3644. issn: 0027-8424, 1091-6490

work page 2018

[6] [6]

Efficient Estimation of Word Representations in Vector Space

Mikolov, Tomas et al. (2013). “Eﬃcient Estimation of Word Representations in Vector Space”. In: arXiv: 1301.3781 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2013

[7] [7]

Glove: Global Vec- tors for Word Representation

Pennington, Jeﬀrey, Richard Socher, and Christopher Manning (2014). “Glove: Global Vec- tors for Word Representation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Doha, Qatar: Association for Com- putational Linguistics, pp. 1532–1543. 9

work page 2014