pith. sign in

arxiv: 2604.22555 · v2 · submitted 2026-04-24 · 💻 cs.CL

Using Embedding Models to Improve Probabilistic Race Prediction

Pith reviewed 2026-05-08 11:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords race predictionBISGname embeddingsprobabilistic imputationvoter filesracial disparitiesneural networksuncommon surnames
0
0 comments X

The pith

Embedding models improve race probability estimates for people with uncommon surnames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard BISG race prediction relies on Census surname lists that cover only common names and defaults to uninformative priors for the remaining 10 percent of the population. The paper introduces eBISG, which converts names into dense vectors using pre-trained embeddings and trains neural networks on 2020 Census name data plus Southern-state voter files to produce race probabilities for those omitted names. Successive versions that add first-name embeddings and then full-name embeddings each raise predictive accuracy, with the largest gains for Hispanic and Asian individuals whose surnames are absent from the Census. Better estimates matter because many studies of racial disparities in voting, health, or lending depend on imputing race at the individual level from available records.

Core claim

The authors show that eBISG approaches outperform standard BISG for uncommon surnames, and that the full-name embedding model trained on voter file data yields the largest gains in race prediction accuracy by capturing interactions between name components that separate surname-only or surname-plus-first-name methods miss.

What carries the argument

The full-name embedding model, which turns complete names into dense vectors from pre-trained embeddings and trains a neural network on Southern voter files to output race probabilities.

If this is right

  • Standard BISG performance drops for the 10 percent of the population with uncommon surnames because it falls back to a generic prior.
  • Adding surname embeddings alone improves predictions for omitted names over the baseline.
  • Combining surname and first-name embeddings produces further gains by using more name information.
  • The full-name embedding captures name-component interactions and delivers the biggest accuracy lift, especially for Hispanic and Asian groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding technique could be adapted to impute other protected attributes from names or addresses in administrative datasets.
  • Regional differences in name-race patterns might cause the Southern-trained models to underperform in states with different demographics.
  • Adopting eBISG in disparity studies would allow researchers to retain more observations instead of dropping records with missing surnames.

Load-bearing premise

Neural networks trained on Southern-state voter files and Census data will generalize to produce accurate race probabilities for uncommon names across the full US population without introducing new biases.

What would settle it

A direct accuracy comparison of eBISG versus standard BISG on a large national sample of individuals with uncommon surnames and verified race labels would falsify the central claim if the embedding models do not show higher precision or recall.

Figures

Figures reproduced from arXiv: 2604.22555 by Kosuke Imai, Noah Dasanaike.

Figure 1
Figure 1. Figure 1: Precision-recall curves comparing voters whose surnames appear in the Census view at source ↗
Figure 2
Figure 2. Figure 2: Precision-recall curves for BISG and eBISG race prediction among voters with view at source ↗
Figure 3
Figure 3. Figure 3: Brier score for individual-level race prediction among voters with unmatched view at source ↗
Figure 4
Figure 4. Figure 4: Calibration of BISG race probabilities for all voters. Points are sized by the number view at source ↗
read the original abstract

Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that standard BISG degrades for the ~10% of the US population with uncommon surnames omitted from Census lists because it falls back to an uninformative prior; it proposes eBISG, which represents names via pre-trained embeddings and trains neural networks on 2020 Census surname/first-name data plus Southern-state voter files to produce race probabilities for unlisted names. Five successive variants are compared (surname-only BISG, BIFSG with first names, surname embedding, surname+first-name embedding, and full-name embedding), with each step reported to improve accuracy and the full-name model yielding the largest gains, especially for Hispanic and Asian voters.

Significance. If the reported gains are robust, the work directly addresses a well-known coverage gap in BISG that affects a non-trivial fraction of the population, offering a practical route to higher-fidelity individual race probabilities for disparity studies. The incremental empirical comparison of embedding strategies is a clear strength and could be cited in future applied work on name-based inference.

major comments (2)
  1. [Abstract] Abstract: the claim of 'consistent improvements across five approaches' is presented without any mention of validation splits, error bars, baseline comparisons, or overfitting checks, leaving the magnitude and reliability of the gains impossible to assess from the provided information.
  2. [Methods (full-name embedding)] Methods (full-name embedding training): the model is trained exclusively on Southern-state voter files; no held-out evaluation on non-Southern or nationally representative samples is described, so it remains unclear whether the reported gains for uncommon surnames generalize beyond the training distribution or simply reflect regional name-race correlations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review of our manuscript. We appreciate the referee's focus on the clarity of our claims and the generalizability of our methods. Below we respond to each major comment, indicating planned revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'consistent improvements across five approaches' is presented without any mention of validation splits, error bars, baseline comparisons, or overfitting checks, leaving the magnitude and reliability of the gains impossible to assess from the provided information.

    Authors: We agree that the abstract, due to length constraints, does not detail the evaluation procedures. The full manuscript describes the validation splits from Census and voter file data, includes baseline comparisons for all five approaches, reports performance metrics, and incorporates robustness checks. We will revise the abstract to briefly reference the held-out validation and statistical assessment of gains. revision: yes

  2. Referee: [Methods (full-name embedding)] Methods (full-name embedding training): the model is trained exclusively on Southern-state voter files; no held-out evaluation on non-Southern or nationally representative samples is described, so it remains unclear whether the reported gains for uncommon surnames generalize beyond the training distribution or simply reflect regional name-race correlations.

    Authors: The full-name embedding uses Southern-state voter files because they supply large-scale self-reported race linked to full names, enabling capture of component interactions absent from Census lists. We acknowledge this may embed regional name-race patterns. In revision we will expand the methods to justify the data choice and add a limitations section discussing possible regional biases along with a call for future national-scale validation; we cannot add new held-out experiments without further data. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical gains validated on held-out data

full rationale

The paper's core contribution is an empirical comparison of five eBISG variants (standard BISG, BIFSG, and three embedding-based extensions) trained on Census surname/first-name data plus Southern voter files, with performance measured via predictive accuracy on held-out observations. No derivation reduces a claimed result to its own fitted parameters by construction, no uniqueness theorem is invoked via self-citation, and no ansatz is smuggled through prior work. The reported improvements for uncommon surnames are presented as data-driven outcomes rather than definitional identities, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that name embeddings trained on Census and voter data capture stable race associations for unseen names, plus many fitted neural-network parameters.

free parameters (1)
  • Neural network architecture and training hyperparameters
    Chosen to map name embeddings to race probability distributions on Census and voter data.
axioms (1)
  • domain assumption Pre-trained text embeddings preserve race-relevant information from surname and first-name data
    Invoked to represent uncommon names as vectors for probability estimation.
invented entities (1)
  • eBISG full-name embedding model no independent evidence
    purpose: Captures interactions between name components to predict race for names absent from Census lists
    New component introduced to address the generic-prior limitation of standard BISG.

pith-pipeline@v0.9.0 · 5519 in / 1196 out tokens · 70285 ms · 2026-05-08T11:35:22.379048+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

  1. [1]

    Adjaye-Gbewonyo, D., R. A. Bednarczyk, R. L. Davis, and S. B. Omer (2014). Using the Bayesian improved surname geocoding method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: A validation study.Health Services Research 49(1), 268–283. Akiba, T., S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019). Optun...

  2. [2]

    Olivella, and E

    Imai, K., S. Olivella, and E. T. R. Rosenman (2022). Addressing census data problems in race imputation via fully Bayesian improved surname geocoding and name supplements.Science Advances 8(49), eadc9824. Jain, V ., T. Enamorado, and C. Rudin (2022). The importance of being Ernest, Ekundayo, or Eswari: An interpretable machine learning approach to name- b...

  3. [3]

    Sood, G. and S. Laohaprapanon (2018). Predicting race and ethnicity from the sequence of characters in a name.arXiv preprint arXiv:1805.02109. Voicu, I. (2018). Using first name information to improve race and ethnicity classifi- cation.Statistics and Public Policy 5(1), 1–13. Wang, L., N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei (2024). Multiling...