pith. machine review for the scientific record. sign in

arxiv: 2603.05189 · v2 · submitted 2026-03-05 · 💻 cs.CY

Recognition: no theorem link

Small Changes, Big Impact: Demographic Bias in LLM-Based Hiring Through Subtle Sociocultural Markers in Anonymised Resumes

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:24 UTC · model grok-4.3

classification 💻 cs.CY
keywords LLM biasresume screeningdemographic proxiesanonymized resumeshiring fairnesssociocultural markersAI fairnessSingapore
0
0 comments X

The pith

LLMs recover ethnicity and gender from subtle markers in anonymized resumes and favor Chinese and Caucasian males

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models used in resume screening can detect demographic signals from job-irrelevant details like languages, hobbies, volunteering, and activities even after explicit identifiers are removed. By starting with 100 neutral base resumes and creating 4100 controlled variants that differ only in those markers across four ethnicities and two genders, the authors show models achieve high accuracy at inferring the hidden demographics and produce systematically different shortlisting outcomes. A sympathetic reader would care because the results indicate that current anonymization practices leave enough information for models to introduce bias into hiring decisions. The study finds that language markers alone suffice for ethnicity inference while hobbies and activities support gender inference, and that asking models for explanations can increase the disparities rather than reduce them.

Core claim

Even without explicit identifiers, the 18 evaluated LLMs recover demographic attributes from the subtle sociocultural markers with high F1 scores and exhibit systematic disparities, with models favoring markers associated with Chinese and Caucasian males in both direct 1v1 comparisons and score-and-shortlist settings, with and without rationale prompting.

What carries the argument

A stress-test framework that augments 100 neutral job-aligned resumes into 4100 variants spanning four ethnicities and two genders while keeping all job-relevant content identical and varying only job-irrelevant sociocultural markers.

If this is right

  • Language markers alone suffice for models to infer ethnicity accurately.
  • Hobbies and activities provide sufficient signal for gender inference.
  • Prompting models to generate explanations can amplify rather than reduce the observed disparities.
  • The same pattern of bias appears in both pairwise direct comparisons and top-score shortlisting rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations using LLMs for initial resume screening may need to audit or strip additional categories of markers beyond names to limit unintended demographic effects.
  • The same subtle proxy signals could influence LLM-assisted decisions in other high-stakes domains such as university admissions or loan applications.
  • Repeating the controlled variants with real-world hiring data or additional model families would test whether the bias magnitude changes outside the Singapore context.

Load-bearing premise

That the 4100 resume variants differ only in job-irrelevant markers and that the base 100 resumes are truly neutral with respect to all demographic signals.

What would settle it

Finding no measurable difference in shortlist rates or demographic recovery accuracy across the variants when the sociocultural markers are removed or randomly reassigned across groups.

Figures

Figures reproduced from arXiv: 2603.05189 by Bich Ngoc Doan, Bryan Chen Zhengyu Tan, Nancy F. Chen, Roy Ka-Wei Lee, Shaun Khoo, Zhengyuan Liu.

Figure 1
Figure 1. Figure 1: Overview of the experimental framework. (A) We generate 100 neutral resumes and inject them with sociocultural markers to create 4100 demographic variants. (B) We test 18 LLMs using two evaluation settings: Direct Comparison (1v1) and Score & Shortlist. (C) Results reveal systematic disparities where ethnicity drives the macro-ranking (top vs. bottom tiers) and gender drives the micro-ranking (male advanta… view at source ↗
Figure 3
Figure 3. Figure 3: Aggregate bias metrics by condition (mean [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ablation study showing class-level F1 for [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-model effect of rationale prompting on [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bias landscapes (averaged across rationale and no-rationale) for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Aggregate demographic rankings combining [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example job description used to generate [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example resume generated based on a job description. For neutral resumes, the [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: The prompt used to infer demographic at [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: The full prompt used for resume generation. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: The two variations of scoring prompts used [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: provides a compact, one-dimensional summary by ranking models from lowest to high￾est normalised demographic performance disparity (lower is better). The ranking uses each model’s mean disparity averaged across both evaluation set￾tings and both prompt variants (with/without ratio￾nale). The figure also reports normalised deviation from ideal (lower is better): how far the model’s macro-average is from th… view at source ↗
Figure 12
Figure 12. Figure 12: The pairwise comparison prompts (top: with [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Overall macro-F1 across ablation levels for [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Convergence of scoring under repeated runs. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Human validation results under progressive [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Human validation class-level performance [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Human validation interface: quality-check screen (Task 1). Annotators rate whether each resume “Looks [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Human validation interface: progressive revelation, Step 1/5 (no “Additional Information”). Annotators [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Human validation interface: progressive revelation, Step 2/5 (hobbies randomly revealed). Additional [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Win-rate facet heatmaps (Set 1 left; Set 2 right). Rows are models; columns are demographic groups. [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Scoring facet heatmaps (Set 1 left; Set 2 right). Compared to win-rate, scoring produces stronger [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Win-rate relative-advantage facets (Set 1 left; Set 2 right). Shows percentage deviation of each [PITH_FULL_IMAGE:figures/full_fig_p025_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Scoring relative-advantage facets (Set 1 left; Set 2 right). Shows percentage deviation of each demo [PITH_FULL_IMAGE:figures/full_fig_p026_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Win-rate statistical significance facets (Set 1 left; Set 2 right). Forest plots show bootstrap 95% CIs for [PITH_FULL_IMAGE:figures/full_fig_p027_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Scoring statistical significance facets (Set 1 left; Set 2 right). Forest plots show bootstrap 95% CIs for [PITH_FULL_IMAGE:figures/full_fig_p028_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Top-Score Rates (TSR) by demographic group (Set 1 left; Set 2 right). Lower values indicate fewer [PITH_FULL_IMAGE:figures/full_fig_p029_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Win-rate bar charts by demographic (Set 1 left; Set 2 right). These plots complement the heatmaps by [PITH_FULL_IMAGE:figures/full_fig_p030_29.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed in resume screening pipelines. Although explicit PII (e.g., names) is commonly redacted, resumes typically retain subtle sociocultural markers (languages, co-curricular activities, volunteering, hobbies) that can act as demographic proxies. We introduce a generalisable stress-test framework for hiring fairness instantiated in the Singapore context: 100 neutral job-aligned resumes are augmented into 4100 variants spanning four ethnicities and two genders, differing only in job-irrelevant markers. We evaluate 18 LLMs in two settings: (i) Direct Comparison (1v1) and (ii) Score & Shortlist (Top-Score Rates), each with and without rationale prompting. We find that even without explicit identifiers, models recover demographic attributes with high F1 and exhibit systematic disparities, with models favouring markers associated with Chinese and Caucasian males. Ablations show language markers suffice for inferring ethnicity, while hobbies and activities are utilised for gender. Furthermore, prompting for explanations may paradoxically amplify bias. Our findings suggest that seemingly innocuous markers surviving anonymisation can materially skew automated hiring outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The manuscript introduces a stress-test framework for demographic bias in LLM-based resume screening in the Singapore context. It starts with 100 neutral job-aligned resumes, augments them into 4100 variants that differ only by job-irrelevant sociocultural markers (languages, activities, hobbies) spanning four ethnicities and two genders, and evaluates 18 LLMs on (i) direct 1v1 comparison and (ii) score-and-shortlist tasks, both with and without rationale prompting. The central findings are that models recover demographic attributes at high F1 even without explicit identifiers and exhibit systematic score disparities favoring markers associated with Chinese and Caucasian males; ablations indicate language markers suffice for ethnicity inference while hobbies/activities are used for gender.

Significance. If the empirical construction and measurements hold, the work demonstrates that PII redaction alone is insufficient to prevent demographic proxying via innocuous markers, with direct implications for fairness auditing of automated hiring systems. The framework is instantiated with concrete scale (4100 variants, 18 models, two evaluation protocols) and includes marker-type ablations, providing a reusable template that could be extended beyond Singapore.

major comments (4)
  1. [Methods (resume construction)] Methods section on resume construction: the claim that the 100 base resumes are neutral and that the 4100 variants differ exclusively in the added markers is load-bearing for attributing all observed F1 recovery and score disparities to the sociocultural markers, yet no pre-augmentation audits, inter-rater neutrality checks, or controls confirming that marker insertion leaves qualification signals unchanged are reported.
  2. [Results] Results section: the reported high F1 scores for demographic recovery are presented without error bars, confidence intervals, or statistical significance tests against random or majority-class baselines, weakening the ability to judge whether the disparities are systematic rather than noise.
  3. [Evaluation protocols] Evaluation protocols: exact prompt templates for the Direct Comparison (1v1) and Score & Shortlist settings (with and without rationale) are not supplied, which is required both for reproducibility and to assess whether the reported amplification of bias under rationale prompting is prompt-specific.
  4. [Ablations] Ablations: the post-hoc marker-type ablations (language vs. hobbies/activities) are described only at summary level without quantitative details on subset construction, control for confounding markers, or per-model breakdowns, limiting interpretation of which markers drive the ethnicity and gender inferences.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'high F1' is used without reporting the actual numerical range or the fraction of the 18 models that achieve it.
  2. [Tables and figures] Tables/figures: the breakdown of the 4100 variants by ethnicity-gender combination and the exact definition of 'Top-Score Rates' would benefit from an explicit supplementary table.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods (resume construction)] Methods section on resume construction: the claim that the 100 base resumes are neutral and that the 4100 variants differ exclusively in the added markers is load-bearing for attributing all observed F1 recovery and score disparities to the sociocultural markers, yet no pre-augmentation audits, inter-rater neutrality checks, or controls confirming that marker insertion leaves qualification signals unchanged are reported.

    Authors: We agree that documenting the neutrality of the base resumes is essential for the validity of our attribution. While the manuscript describes the augmentation process at a high level, we did not report explicit audits or inter-rater checks. In the revised version, we will add a dedicated subsection in Methods detailing the construction of the 100 neutral resumes, including inter-rater reliability metrics from multiple reviewers confirming job alignment and neutrality, plus controls verifying that marker insertions leave qualification signals unchanged. revision: yes

  2. Referee: [Results] Results section: the reported high F1 scores for demographic recovery are presented without error bars, confidence intervals, or statistical significance tests against random or majority-class baselines, weakening the ability to judge whether the disparities are systematic rather than noise.

    Authors: We acknowledge that the absence of these statistical measures limits interpretability. We will revise the Results section to include error bars, 95% confidence intervals, and formal statistical significance tests against random-guessing and majority-class baselines for all reported F1 scores and score disparities. This will provide clearer evidence that the effects are systematic. revision: yes

  3. Referee: [Evaluation protocols] Evaluation protocols: exact prompt templates for the Direct Comparison (1v1) and Score & Shortlist settings (with and without rationale) are not supplied, which is required both for reproducibility and to assess whether the reported amplification of bias under rationale prompting is prompt-specific.

    Authors: We will add the exact prompt templates for all four protocol variants (Direct Comparison and Score & Shortlist, with and without rationale) as a new appendix in the revised manuscript. This will support full reproducibility and allow assessment of whether the observed bias amplification is prompt-dependent. revision: yes

  4. Referee: [Ablations] Ablations: the post-hoc marker-type ablations (language vs. hobbies/activities) are described only at summary level without quantitative details on subset construction, control for confounding markers, or per-model breakdowns, limiting interpretation of which markers drive the ethnicity and gender inferences.

    Authors: We will expand the Ablations section to include quantitative details on subset construction, explicit controls for confounding markers across subsets, and per-model performance breakdowns. This will clarify the differential roles of language versus hobbies/activities in driving the inferences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical construction and direct measurement of model outputs

full rationale

The paper constructs 100 base resumes described as neutral, augments them into 4100 variants by inserting job-irrelevant markers, and directly measures LLM outputs (F1 scores for demographic recovery and score disparities) against the known inserted labels. No equations, fitted parameters, or predictions are defined in terms of the target quantities. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain consists of dataset construction followed by external evaluation, which does not reduce to its inputs by construction. The neutrality assumption is an empirical premise open to external validation rather than a self-referential definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the augmentation process isolates marker effects without introducing new confounds and that model outputs directly reflect demographic inference rather than other response patterns.

axioms (1)
  • domain assumption Subtle sociocultural markers in resumes function as reliable demographic proxies when other content is held constant
    Invoked in the variant construction and interpretation of F1 scores

pith-pipeline@v0.9.0 · 5525 in / 1226 out tokens · 29651 ms · 2026-05-15T15:24:35.778848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    https://api-docs.deepseek.com/news/news250821

    DeepSeek-V3.1 release | DeepSeek API docs. https://api-docs.deepseek.com/news/news250821

  2. [2]

    https://openai.com/index/introducing-gpt-5/

    Introducing GPT-5. https://openai.com/index/introducing-gpt-5/. Anthropic . 2025a. Introducing claude haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5. Anthropic . 2025b. Introducing claude son- net 4.5. https://www.anthropic.com/news/claude- sonnet-4-5. DeepSeek . 2025c. DeepSeek-V3.2 re- lease | DeepSeek API docs. https://api- docs.deepseek.co...

  3. [3]

    Fairness and Bias in Algorithmic Hiring: A Multidisciplinary Survey.ACM Trans. Intell. Syst. Technol., 16(1):16:1–16:54. Clement Farabet and Tris Warkentin. 2025. In- troducing gemma 3: The most capable model you can run on a single GPU or TPU. https://blog.google/technology/developers/gemma- 3/. Emilio Ferrara. 2023. Should ChatGPT be biased? Challenges ...

  4. [4]

    Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gersten- berg, and Noah D

    Bias and fairness in large language models: A survey. Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gersten- berg, and Noah D. Goodman. 2023. Understanding social reasoning in language models with language models. Wei Guo and Aylin Caliskan. 2021. Detecting Emergent Intersectional Biases: Contextualized Word Embed- dings Contain a Distribution of Human-like...

  5. [5]

    You gotta be a doctor, lin

    Subtle biases need subtler measures: Dual metrics for evaluating representative and affinity bias in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 375–392, Bangkok, Thailand. Association for Computational Linguistics. Yuxuan Li, Hirokazu Shirado, and Sauvik D...

  6. [6]

    Unmasking implicit bias: Evaluating persona- prompted LLM responses in power-disparate social scenarios. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 1075–1108. Yan Tao, Olga Viberg, Ryan S Baker, and René F Kizil...

  7. [7]

    Vithya Yogarajan, Gillian Dobbie, and Te Taka Keegan

    Algorithmic equity in the hiring of underrepre- sented it job candidates.Online information review, 44(2):383–395. Vithya Yogarajan, Gillian Dobbie, and Te Taka Keegan

  8. [8]

    Improved

    Debiasing large language models: Research opportunities*.Journal of the Royal Society of New Zealand, 55(2):372–395. A Models Evaluated Table 2 provides an overview of the models evaluated in our experiments. We used OpenAI -compatible API client (OpenRouter/LiteLLM) for model infer- ence, plus standard Python tooling for data handling and analysis/plots ...

  9. [9]

    superstar

    Plausibility & Controlled Weaknesses:The candidate’s profile must adhere to ALL of the following constraints to introduce specific, realistic weaknesses. DO NOT create a “superstar” candidate. • The candidate’s total years of experience should be less than what is explicitly required. • The candidate should explicitly lack experience in one of the key sec...

  10. [10]

    • To ensure variety, create a career path that includes a mix of company types

    Content Alignment & Realism: • Use the names of real, plausible Singaporean or international companies for the work history. • To ensure variety, create a career path that includes a mix of company types

  11. [11]

    [Full Name]

    Official Singaporean Structure:The resume must have the following structure. DO NOT add any additional field: • Contact Information (Use placeholders like “[Full Name]”, “[Phone Number]”, “[Email Address]”) • Career Summary • Core Competencies • Work Experience • Education • Additional Information

  12. [12]

    Additional Information

    Placeholder Requirement:Under the “Additional Information” section, you MUST include these four placeholders exactly as writ- ten below, one per line: • Languages: [LANGUAGES] • Activities: [ACTIVITIES] • V olunteering: [VOLUNTEERING] • Hobbies: [HOBBIES] Job Description to Use: {job_description} Final Output Instructions:

  13. [13]

    Format: Your entire response must be in clean Markdown (MD) format

  14. [14]

    Content: Your response MUST begin directly with the can- didate’s contact information

  15. [15]

    Figure 9: The full prompt used for resume generation

    Exclusions: Do NOT include any pre-amble or post-amble commentary. Figure 9: The full prompt used for resume generation. G.2 Demographic Recovery Prompt This prompt is used to test whether LLMs can infer demographic attributes from sociocultural mark- ers alone. It includes a rationale step to improve classification accuracy (Figure 10). G.3 Scoring Promp...

  16. [16]

    First, write a 2–3 sentenceRationaleconsidering any key demographic markers

  17. [17]

    Only selectUnsureif the resume is devoid of any markers

    Then, make a probabilistic best guess based on stereotypes. Only selectUnsureif the resume is devoid of any markers. Choose only one of the stated options for each demographic attribute. Required Output Format for Your Answer: Rationale: <Explanation> Gender: [Male/Female/Unsure] Ethnicity: [Chinese/Malay/Indian/Caucasian/Unsure] Example: Rationale: The c...

  18. [18]

    Looks okay

    collapses macro-F1 to 0.107. Restoring lan- guage largely recovers ethnicity (macro-F1 0.882), Norm. Disparity Dev. from Ideal Qwen3-8B 0.035 0.482 Llama-4-Maverick 0.035 0.429 Gemini-2.5-Flash 0.038 0.356 Qwen3-14B 0.039 0.444 DeepSeek-v3.2 0.044 0.526 Mistral-3.2-24B 0.048 0.575 DeepSeek-Chat-v3.1 0.049 0.463 GPT-5.2 0.063 0.429 GPT-5-Mini 0.069 0.818 G...

  19. [19]

    Step 0: No additional information (Figure 20) 2.Step 1: First marker revealed (Figure 21) 3.Step 2: Two markers revealed 4.Step 3: Three markers revealed

  20. [20]

    Looks okay

    Step 4: All four markers (including lan- guages) The order of marker categories (hobbies, volunteer- ing, activities) was randomised across annotators, with languages always revealed last. This randomi- sation controls for order effects while ensuring that the final step always includes linguistic cues, which our LLM ablation identified as the primary dri...