Using Embedding Models to Improve Probabilistic Race Prediction

Kosuke Imai; Noah Dasanaike

arxiv: 2604.22555 · v2 · submitted 2026-04-24 · 💻 cs.CL

Using Embedding Models to Improve Probabilistic Race Prediction

Noah Dasanaike , Kosuke Imai This is my paper

Pith reviewed 2026-05-08 11:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords race predictionBISGname embeddingsprobabilistic imputationvoter filesracial disparitiesneural networksuncommon surnames

0 comments

The pith

Embedding models improve race probability estimates for people with uncommon surnames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard BISG race prediction relies on Census surname lists that cover only common names and defaults to uninformative priors for the remaining 10 percent of the population. The paper introduces eBISG, which converts names into dense vectors using pre-trained embeddings and trains neural networks on 2020 Census name data plus Southern-state voter files to produce race probabilities for those omitted names. Successive versions that add first-name embeddings and then full-name embeddings each raise predictive accuracy, with the largest gains for Hispanic and Asian individuals whose surnames are absent from the Census. Better estimates matter because many studies of racial disparities in voting, health, or lending depend on imputing race at the individual level from available records.

Core claim

The authors show that eBISG approaches outperform standard BISG for uncommon surnames, and that the full-name embedding model trained on voter file data yields the largest gains in race prediction accuracy by capturing interactions between name components that separate surname-only or surname-plus-first-name methods miss.

What carries the argument

The full-name embedding model, which turns complete names into dense vectors from pre-trained embeddings and trains a neural network on Southern voter files to output race probabilities.

If this is right

Standard BISG performance drops for the 10 percent of the population with uncommon surnames because it falls back to a generic prior.
Adding surname embeddings alone improves predictions for omitted names over the baseline.
Combining surname and first-name embeddings produces further gains by using more name information.
The full-name embedding captures name-component interactions and delivers the biggest accuracy lift, especially for Hispanic and Asian groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding technique could be adapted to impute other protected attributes from names or addresses in administrative datasets.
Regional differences in name-race patterns might cause the Southern-trained models to underperform in states with different demographics.
Adopting eBISG in disparity studies would allow researchers to retain more observations instead of dropping records with missing surnames.

Load-bearing premise

Neural networks trained on Southern-state voter files and Census data will generalize to produce accurate race probabilities for uncommon names across the full US population without introducing new biases.

What would settle it

A direct accuracy comparison of eBISG versus standard BISG on a large national sample of individuals with uncommon surnames and verified race labels would falsify the central claim if the embedding models do not show higher precision or recall.

Figures

Figures reproduced from arXiv: 2604.22555 by Kosuke Imai, Noah Dasanaike.

**Figure 1.** Figure 1: Precision-recall curves comparing voters whose surnames appear in the Census view at source ↗

**Figure 2.** Figure 2: Precision-recall curves for BISG and eBISG race prediction among voters with view at source ↗

**Figure 3.** Figure 3: Brier score for individual-level race prediction among voters with unmatched view at source ↗

**Figure 4.** Figure 4: Calibration of BISG race probabilities for all voters. Points are sized by the number view at source ↗

read the original abstract

Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

eBISG adds embedding-based handling for uncommon surnames with clear incremental gains on the tested data, but Southern-only training leaves generalization to national use unproven.

read the letter

The main thing to know is that this paper gives a practical fix for the 10% of names missing from Census surname lists by training neural nets on name embeddings, and the full-name version trained on Southern voter files produces the biggest reported lift for Hispanic and Asian cases in their voter data tests. What is actually new is the step-wise application of pre-trained embeddings plus neural networks to generate race probabilities for omitted surnames, extending beyond the standard BISG and BIFSG priors. They lay out five variants and show each addition improves performance on the held-out voter file splits they use. That empirical comparison is straightforward and gives credit to the idea that name interactions matter. The soft spot is the training distribution. Everything for the embedding models comes from Southern-state voter files, and the abstract gives no indication of held-out checks on non-Southern or nationally representative samples. Name-race correlations differ by region, so the gains could shrink or shift when applied elsewhere. Details on validation splits, error bars, overfitting checks, and potential new biases are also thin in the summary, which makes it harder to judge how reliable the numbers are. The full manuscript may fill some gaps, but the regional training issue is real and not minor. This is for social scientists and policy analysts who already run BISG on large administrative files and need better coverage for uncommon names. A reader in that space would get usable ideas from the method if they re-validate it locally. It deserves peer review because the core limitation it targets is genuine and the empirical setup is reproducible enough to let referees test the claims directly.

Referee Report

2 major / 0 minor

Summary. The paper claims that standard BISG degrades for the ~10% of the US population with uncommon surnames omitted from Census lists because it falls back to an uninformative prior; it proposes eBISG, which represents names via pre-trained embeddings and trains neural networks on 2020 Census surname/first-name data plus Southern-state voter files to produce race probabilities for unlisted names. Five successive variants are compared (surname-only BISG, BIFSG with first names, surname embedding, surname+first-name embedding, and full-name embedding), with each step reported to improve accuracy and the full-name model yielding the largest gains, especially for Hispanic and Asian voters.

Significance. If the reported gains are robust, the work directly addresses a well-known coverage gap in BISG that affects a non-trivial fraction of the population, offering a practical route to higher-fidelity individual race probabilities for disparity studies. The incremental empirical comparison of embedding strategies is a clear strength and could be cited in future applied work on name-based inference.

major comments (2)

[Abstract] Abstract: the claim of 'consistent improvements across five approaches' is presented without any mention of validation splits, error bars, baseline comparisons, or overfitting checks, leaving the magnitude and reliability of the gains impossible to assess from the provided information.
[Methods (full-name embedding)] Methods (full-name embedding training): the model is trained exclusively on Southern-state voter files; no held-out evaluation on non-Southern or nationally representative samples is described, so it remains unclear whether the reported gains for uncommon surnames generalize beyond the training distribution or simply reflect regional name-race correlations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review of our manuscript. We appreciate the referee's focus on the clarity of our claims and the generalizability of our methods. Below we respond to each major comment, indicating planned revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'consistent improvements across five approaches' is presented without any mention of validation splits, error bars, baseline comparisons, or overfitting checks, leaving the magnitude and reliability of the gains impossible to assess from the provided information.

Authors: We agree that the abstract, due to length constraints, does not detail the evaluation procedures. The full manuscript describes the validation splits from Census and voter file data, includes baseline comparisons for all five approaches, reports performance metrics, and incorporates robustness checks. We will revise the abstract to briefly reference the held-out validation and statistical assessment of gains. revision: yes
Referee: [Methods (full-name embedding)] Methods (full-name embedding training): the model is trained exclusively on Southern-state voter files; no held-out evaluation on non-Southern or nationally representative samples is described, so it remains unclear whether the reported gains for uncommon surnames generalize beyond the training distribution or simply reflect regional name-race correlations.

Authors: The full-name embedding uses Southern-state voter files because they supply large-scale self-reported race linked to full names, enabling capture of component interactions absent from Census lists. We acknowledge this may embed regional name-race patterns. In revision we will expand the methods to justify the data choice and add a limitations section discussing possible regional biases along with a call for future national-scale validation; we cannot add new held-out experiments without further data. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical gains validated on held-out data

full rationale

The paper's core contribution is an empirical comparison of five eBISG variants (standard BISG, BIFSG, and three embedding-based extensions) trained on Census surname/first-name data plus Southern voter files, with performance measured via predictive accuracy on held-out observations. No derivation reduces a claimed result to its own fitted parameters by construction, no uniqueness theorem is invoked via self-citation, and no ansatz is smuggled through prior work. The reported improvements for uncommon surnames are presented as data-driven outcomes rather than definitional identities, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that name embeddings trained on Census and voter data capture stable race associations for unseen names, plus many fitted neural-network parameters.

free parameters (1)

Neural network architecture and training hyperparameters
Chosen to map name embeddings to race probability distributions on Census and voter data.

axioms (1)

domain assumption Pre-trained text embeddings preserve race-relevant information from surname and first-name data
Invoked to represent uncommon names as vectors for probability estimation.

invented entities (1)

eBISG full-name embedding model no independent evidence
purpose: Captures interactions between name components to predict race for names absent from Census lists
New component introduced to address the generic-prior limitation of standard BISG.

pith-pipeline@v0.9.0 · 5519 in / 1196 out tokens · 70285 ms · 2026-05-08T11:35:22.379048+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Adjaye-Gbewonyo, D., R. A. Bednarczyk, R. L. Davis, and S. B. Omer (2014). Using the Bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: A validation study.Health Services Research 49(1), 268–283. Akiba, T., S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019). Optun...

work page 2014
[2]

Olivella, and E

Imai, K., S. Olivella, and E. T. R. Rosenman (2022). Addressing census data problems in race imputation via fully Bayesian improved surname geocoding and name supplements.Science Advances 8(49), eadc9824. Jain, V ., T. Enamorado, and C. Rudin (2022). The importance of being Ernest, Ekundayo, or Eswari: An interpretable machine learning approach to name- b...

work page arXiv 2022
[3]

Sood, G. and S. Laohaprapanon (2018). Predicting race and ethnicity from the sequence of characters in a name.arXiv preprint arXiv:1805.02109. Voicu, I. (2018). Using first name information to improve race and ethnicity classifi- cation.Statistics and Public Policy 5(1), 1–13. Wang, L., N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei (2024). Multiling...

work page arXiv 2018

[1] [1]

Adjaye-Gbewonyo, D., R. A. Bednarczyk, R. L. Davis, and S. B. Omer (2014). Using the Bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: A validation study.Health Services Research 49(1), 268–283. Akiba, T., S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019). Optun...

work page 2014

[2] [2]

Olivella, and E

Imai, K., S. Olivella, and E. T. R. Rosenman (2022). Addressing census data problems in race imputation via fully Bayesian improved surname geocoding and name supplements.Science Advances 8(49), eadc9824. Jain, V ., T. Enamorado, and C. Rudin (2022). The importance of being Ernest, Ekundayo, or Eswari: An interpretable machine learning approach to name- b...

work page arXiv 2022

[3] [3]

Sood, G. and S. Laohaprapanon (2018). Predicting race and ethnicity from the sequence of characters in a name.arXiv preprint arXiv:1805.02109. Voicu, I. (2018). Using first name information to improve race and ethnicity classifi- cation.Statistics and Public Policy 5(1), 1–13. Wang, L., N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei (2024). Multiling...

work page arXiv 2018