Token-Level Entropy Reveals Demographic Disparities in Language Models
Pith reviewed 2026-05-23 04:19 UTC · model grok-4.3
The pith
Black-associated names produce higher first-token entropy than White-associated names across six language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across six open-weight base models and 5,760 implicit prompts, Black-associated names generate higher first-token entropy than White-associated names in every architecture and produce larger entropy increases over neutral baselines. Women-associated names produce lower entropy and more homogeneous continuations than men-associated names, with race and gender effects additive. Instruction tuning does not shrink the race gap. Explicit demographic labels instead of names yield null race effects in ten of twelve models where the name-based probe is significant.
What carries the argument
First-token Shannon entropy measured at temperature zero over the full vocabulary in implicit sentence-completion prompts that vary only the subject name.
If this is right
- Race and gender effects on token entropy are independent and additive.
- Instruction tuning leaves the race-related entropy gap unchanged.
- Explicit group labels recover different distributional structure than implicit name prompts.
- Gender patterns in entropy align with output homogeneity bias while race patterns run opposite to it.
- Probing method determines which demographic disparities appear in the generative distribution.
Where Pith is reading between the lines
- If the entropy gap tracks training-data co-occurrence statistics, frequency-matched name lists would be expected to shrink or eliminate it.
- Token-level entropy may surface distributional effects that final sampled outputs or direct demographic prompts conceal.
- The additive race-gender pattern implies separate pathways by which each identity signal modulates next-token uncertainty.
Load-bearing premise
Names function as clean, isolated signals of race and gender without other associations such as frequency or cultural context affecting model behavior.
What would settle it
Re-running the exact templates and models with a fresh set of names matched on frequency, length, and common associations but still labeled by demographic category, and finding no entropy difference by group.
read the original abstract
We ask whether demographic identity, signaled by a name alone, systematically reshapes the generative distribution of a language model. Measuring full-vocabulary Shannon entropy at temperature zero across six open-weight base models and 5,760 implicit sentence-completion prompts (e.g., "Tanisha walked into the office on a Monday morning and"), we find that Black-associated names produce higher first-token entropy than White-associated names across all six architectures - opposite to the output-level homogeneity bias documented under explicit demographic prompting (Lee et al., 2024) - and Black-associated names always produce greater entropy above identity-neutral baselines than White-associated names ($\Delta\Delta > 0$ in all six models). Women-associated names co-occur with lower first-token entropy (DL-pooled $\hat\beta = -0.041, p = .019$) and more homogeneous outputs ($\hat\alpha = +0.024, p < .001$) than men-associated names - a pattern convergent with homogeneity bias; race and gender effects are additive. Instruction tuning does not attenuate the race gap (matched-format DL-pooled $\hat{\beta}=+0.153$). Running the same templates with explicit group labels instead of names yields null race effects in 10 of 12 models where implicit probing is significant - establishing that probing methodology is a primary determinant of which distributional structure is recovered.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that implicit demographic signals via names in 5,760 sentence-completion prompts produce systematic shifts in first-token Shannon entropy (at temperature zero) across six open-weight models: Black-associated names yield higher entropy than White-associated names (with ΔΔ > 0 in all models), opposite to explicit-prompt homogeneity bias; female-associated names yield lower entropy (DL-pooled β̂ = −0.041, p = .019) and more homogeneous outputs (α̂ = +0.024, p < .001); race and gender effects are additive; instruction tuning does not attenuate the race gap (matched-format DL-pooled β̂ = +0.153); explicit group labels produce null race effects in 10 of 12 cases where implicit effects are significant.
Significance. If the name lists function as clean demographic proxies, the finding that implicit name probing recovers directional entropy disparities (including ΔΔ > 0) that explicit labels do not would be significant for bias measurement methodology, showing that generative distributions encode demographic structure differently depending on cue type and that instruction tuning leaves the race gap intact.
major comments (3)
- [Abstract] Abstract and prompt-construction description: the central claim that Black-associated names produce reliably higher first-token entropy (and ΔΔ > 0 across all six models) treats the chosen names as isolated demographic signals, yet no details are supplied on name selection criteria, frequency matching, or controls for cultural/socioeconomic associations in the training data; this assumption is load-bearing for both the directionality and the implicit-vs-explicit contrast.
- [Abstract] Statistical reporting (DL-pooled regressions): the reported β̂ and α̂ coefficients with p-values are used to support additive race/gender effects and the non-attenuation claim, but the exact regression specification, model for pooling across architectures, multiple-comparison correction, and error-bar construction are not provided, preventing assessment of whether the p = .019 and p < .001 results are robust.
- [Abstract] Explicit-label comparison: the methodology-dependence conclusion rests on null race effects in 10 of 12 models under explicit labels, but without the per-model breakdown, exact prompt templates, or exclusion rules, the contrast with the implicit results cannot be evaluated for robustness.
minor comments (2)
- [Abstract] The six open-weight base models are not named in the abstract, which would improve reproducibility.
- Clarify whether first-token entropy is computed over the full vocabulary or a restricted set, and how temperature-zero sampling is implemented for entropy calculation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below with clarifications from the manuscript and indicate revisions where additional details will strengthen the presentation. All comments can be addressed through expanded methods and supplementary material.
read point-by-point responses
-
Referee: [Abstract] Abstract and prompt-construction description: the central claim that Black-associated names produce reliably higher first-token entropy (and ΔΔ > 0 across all six models) treats the chosen names as isolated demographic signals, yet no details are supplied on name selection criteria, frequency matching, or controls for cultural/socioeconomic associations in the training data; this assumption is load-bearing for both the directionality and the implicit-vs-explicit contrast.
Authors: The name lists were drawn from established sources in the demographic bias literature (e.g., validated racial associations from prior NLP studies) and frequency-matched where possible using U.S. Census data on name prevalence. We acknowledge that the abstract omits these criteria and that full controls for all cultural or socioeconomic associations in training data are not feasible. The implicit-vs-explicit contrast remains informative even with these caveats, as the same names are used in both conditions. We will add a dedicated methods subsection detailing selection criteria, frequency matching, and a limitations discussion of residual confounds. revision: yes
-
Referee: [Abstract] Statistical reporting (DL-pooled regressions): the reported β̂ and α̂ coefficients with p-values are used to support additive race/gender effects and the non-attenuation claim, but the exact regression specification, model for pooling across architectures, multiple-comparison correction, and error-bar construction are not provided, preventing assessment of whether the p = .019 and p < .001 results are robust.
Authors: The DL-pooled estimates come from linear mixed-effects models with architecture as a random effect (detailed in the methods and supplementary materials). Pooling follows a standard meta-analytic approach across the six models. No multiple-comparison correction was applied because the primary tests were pre-specified; error bars reflect model standard errors. We agree the abstract is too terse and will move the full specification, pooling procedure, and robustness checks into the main text. revision: yes
-
Referee: [Abstract] Explicit-label comparison: the methodology-dependence conclusion rests on null race effects in 10 of 12 models under explicit labels, but without the per-model breakdown, exact prompt templates, or exclusion rules, the contrast with the implicit results cannot be evaluated for robustness.
Authors: The per-model results for explicit labels are reported in the supplementary tables; the 10-of-12 null count aggregates those. Prompt templates for the explicit condition are the same sentence frames with group labels substituted for names. No prompts were excluded beyond basic length and formatting filters. We will add a main-text table with the per-model explicit results, reproduce the exact templates in the methods, and clarify the exclusion criteria to allow direct evaluation of the contrast. revision: yes
Circularity Check
No circularity; direct empirical measurements only
full rationale
The paper computes Shannon entropy directly from model next-token distributions on fixed prompt templates and fits ordinary regressions to those entropy values. No equations define a target quantity in terms of a fitted parameter from the same data, no predictions are generated from a subset fit, and the single self-citation (Lee et al. 2024) is used solely for directional contrast rather than to establish uniqueness or an ansatz. All reported statistics (ΔΔ, β̂, α̂) are computed from the observed token probabilities and are therefore independent of any prior fitted values within this work.
Axiom & Free-Parameter Ledger
free parameters (1)
- DL-pooled beta coefficients
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.