pith. sign in

arxiv: 2606.18649 · v1 · pith:5M7POJGGnew · submitted 2026-06-17 · 💻 cs.MA · cs.CL· cs.CY

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

Pith reviewed 2026-06-26 18:55 UTC · model grok-4.3

classification 💻 cs.MA cs.CLcs.CY
keywords gender biasLLM hiringJapanese resumespro-female biasname reliancemitigation strategiesrandom effects modelrireki sho
0
0 comments X

The pith

Five LLMs show consistent pro-female bias when rating Japanese-style resumes, with the bias driven almost entirely by candidate names.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether gender bias in LLM hiring decisions appears outside English and Western resume formats by running controlled evaluations on Japanese rirekisho documents. It finds that all five models rate candidates with female-signaling names higher than those with male-signaling names. Removing the name from the prompt nearly eliminates the difference, while adding an instruction to make decisions gender-neutral leaves the bias intact. The work also records that one model's safety filter rejects a large share of name-anonymized prompts. These results indicate that the name itself functions as the dominant gender cue for the models.

Core claim

A crossed random-effects linear mixed model applied to 43,200 evaluations of 60 rirekisho resumes across 12 linguistically grounded name pairs confirms a significant pro-female bias in every model tested. The name-reliance analysis shows that the female rating advantage drops by nearly its full magnitude once names are stripped from the prompt. A gender-neutrality instruction produces no meaningful reduction, while an incompatibility between privacy filters and content safety rules causes a 42 percent refusal rate for one model.

What carries the argument

Counterfactual resume design with 12 linguistically grounded name pairs paired with a crossed random-effects linear mixed model that isolates the name as the primary gender channel.

If this is right

  • Pro-female bias in LLM resume screening replicates in a non-Western corporate context.
  • A direct gender-neutrality instruction in the prompt does not reduce the bias.
  • Removing candidate names from the input reduces the bias by nearly its full size.
  • Privacy-filter anonymization triggers high refusal rates in at least one current model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the bias is triggered primarily by names, then any LLM hiring pipeline that retains names will inherit the same directional preference regardless of language.
  • Mitigation may require structural changes to input rather than additional instructions.
  • High refusal rates under anonymization point to a deployment trade-off between bias reduction and system reliability.
  • The same experimental setup could be reused to test whether other demographic signals, such as age or university prestige, produce comparable effects.

Load-bearing premise

The 12 name pairs and 60 rirekisho resumes vary only in gender signal without other uncontrolled linguistic, cultural, or formatting differences that could affect the ratings.

What would settle it

Re-running the full set of evaluations after replacing all names with gender-neutral placeholders and finding no remaining difference in ratings between the original male and female groups would falsify the claim that the name is the dominant channel.

Figures

Figures reproduced from arXiv: 2606.18649 by Akshara Nadayanur Sathis Kanna, Gabriele Trovato, Machiko Hirota, Phan Xuan Tan, Rihito Kotani, Serena A. Hoffstedde, Ujwal Kumar.

Figure 1
Figure 1. Figure 1: Gender bias by model and intervention condition (dz, within-subject). Error bars represent 95% confidence intervals. Significance indicators: *** p < .001, ** p < .01, * p < .05, ns = not significant. 4.2 Industry and Role Moderation As an exploratory analysis, we examined whether bias varies across industries and role types. Both Finance and Healthcare sectors show significant pro-female bias in the basel… view at source ↗
Figure 2
Figure 2. Figure 2: Gender bias under baseline and privacy filter conditions for the four analysable models (dz, within-subject). GPT-4o is excluded due to a 42% refusal rate. Error bars represent 95% confidence intervals. The name-reliance analysis provides the clearest evidence to date about the mechanism through which this bias operates. Removing the candidate name from the prompt reduces the female effect by nearly its fu… view at source ↗
Figure 3
Figure 3. Figure 3: Visual examples of the two resume formats used in this study. (a) Western-style resume template [24]. (b) Japanese rirekisho template [25] [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Linear mixed-effects model fixed effects (Holm-corrected). Error bars represent 95% confidence intervals. Positive values indicate female preference [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Score distributions by model and candidate gender (baseline condition). Note that absolute score levels are not comparable across models [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Gender bias (dz) by score dimension and model (baseline condition). Positive values indicate female preference [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Gender bias (dz) by role type (baseline condition). Blue bars indicate male￾dominated roles, orange bars female-dominated roles. Error bars represent 95% confi￾dence intervals [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a large-scale empirical study (43,200 API calls) of gender bias in LLM hiring evaluations of Japanese rirekisho resumes. Using 60 resumes and 12 name pairs chosen for linguistically grounded gender signals, it applies a crossed random-effects linear mixed model to five LLMs and finds consistent pro-female bias. A gender-neutrality prompt instruction fails to reduce the bias, while removing candidate names from the prompt nearly eliminates the female effect; an incompatibility between a privacy filter and GPT-4o's safety filter is also documented.

Significance. If the central attribution to gender holds, the work usefully extends Western-centric findings on LLM hiring bias to a Japanese corporate context and supplies a concrete mechanism test via the name-reliance contrast. The experiment scale and use of a crossed random-effects model are strengths that support replicability claims.

major comments (2)
  1. [Methods (name selection)] Methods, name-pair selection: the linguistically grounded gender-signal criteria for the 12 pairs are not shown to be orthogonal to birth-cohort, prefecture, or socioeconomic covariates known to correlate with Japanese names; any such correlation would be absorbed into the gender fixed effect of the crossed random-effects model and would also invalidate the name-removal contrast.
  2. [Methods (resume design)] Methods, resume construction: the description does not confirm that resume content, formatting, and other non-name elements were fully balanced or counterbalanced across the name conditions; residual imbalances would undermine the claim that observed rating differences are attributable solely to the gender signal carried by the name.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'reduces the female effect by nearly its full magnitude' should be accompanied by the precise coefficient change and its standard error from the relevant model.
  2. [Results (model specification)] The manuscript would benefit from an explicit statement of the exact random-effects structure (which factors are crossed, which are nested) in the linear mixed model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below, indicating revisions that will be incorporated in the next version.

read point-by-point responses
  1. Referee: [Methods (name selection)] Methods, name-pair selection: the linguistically grounded gender-signal criteria for the 12 pairs are not shown to be orthogonal to birth-cohort, prefecture, or socioeconomic covariates known to correlate with Japanese names; any such correlation would be absorbed into the gender fixed effect of the crossed random-effects model and would also invalidate the name-removal contrast.

    Authors: We appreciate the referee's point on potential confounding in name selection. Our 12 name pairs were selected using linguistically grounded criteria focused on gender-distinctive kanji characters and phonetic patterns common in Japanese naming conventions. However, we did not include explicit checks for correlations with birth-cohort, prefecture, or socioeconomic status in the original submission. In the revised manuscript, we will add supplementary analyses drawing on publicly available Japanese name frequency databases to quantify any such correlations. Should meaningful correlations emerge, we will report them and discuss implications for the gender fixed effect and name-removal contrast; if correlations prove negligible, this will reinforce the validity of our attribution. This addition directly addresses the concern while preserving the study's core design. revision: yes

  2. Referee: [Methods (resume design)] Methods, resume construction: the description does not confirm that resume content, formatting, and other non-name elements were fully balanced or counterbalanced across the name conditions; residual imbalances would undermine the claim that observed rating differences are attributable solely to the gender signal carried by the name.

    Authors: We agree that explicit documentation of balance across conditions is necessary. The study used a counterfactual design in which each of the 60 rirekisho templates was paired with both male and female names from the 12 pairs, keeping all non-name content (education history, employment records, skills, formatting, and layout) identical. The original methods section omitted a full description of this procedure. We will revise the methods to explicitly state that resume templates were fixed with only the name field varied, confirming full counterbalancing of non-name elements by design. This clarification will strengthen the claim that differences arise from the gender signal in the names. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model fit to observed LLM ratings

full rationale

The paper's central results come from fitting a crossed random-effects linear mixed model to 43,200 LLM-generated resume ratings collected under explicit experimental conditions (baseline, instruction, name-removed). The pro-female bias coefficient and the name-reliance contrast (near-total reduction when name is removed) are direct statistical outputs of that model applied to the observed data; they do not reduce by construction to any quantity defined from the input prompts or name-selection criteria. No equations, self-citations, or ansatzes are used to derive the key claims; the design is a standard counterfactual experiment whose validity rests on the empirical isolation of the name signal rather than on any definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of linear mixed models (normality of residuals, independence of random effects) and on the untested premise that the chosen name pairs and resume set vary only in gender signal.

axioms (1)
  • domain assumption Linear mixed model assumptions hold for the rating data (normally distributed residuals, appropriate random-effects structure).
    Invoked when the crossed random-effects model is used to confirm the pro-female bias.

pith-pipeline@v0.9.1-grok · 5803 in / 1292 out tokens · 19464 ms · 2026-06-26T18:55:43.113110+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 3 linked inside Pith

  1. [1]

    https://cdn.openai.com/ gpt-4o-system-card.pdf

    OpenAI: GPT-4o System Card (2024). https://cdn.openai.com/ gpt-4o-system-card.pdf

  2. [2]

    https://anthropic.com/ claude-sonnet-4-6-system-card

    Anthropic: Claude Sonnet 4.6 System Card (2026). https://anthropic.com/ claude-sonnet-4-6-system-card

  3. [3]

    arXiv preprint arXiv:2412.19437 (2024)

    DeepSeek-AI: DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437 (2024)

  4. [4]

    https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Flash-Model-Card.pdf

    Google DeepMind: Gemini 2.5 Flash Model Card (2025). https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-2-5-Flash-Model-Card.pdf

  5. [5]

    arXiv preprint arXiv:2407.21783 (2024)

    Grattafiori, A., et al.: The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 (2024)

  6. [6]

    arXiv preprint arXiv:2506.10922 (2025)

    Karvonen, A., Marks, S.: Robustly Improving LLM Fairness in Realistic Settings via Interpretability. arXiv preprint arXiv:2506.10922 (2025)

  7. [7]

    Digital Scholarship in the Humanities 39(2), 467–484 (2024)

    Barešová, I., Nakaya, N., Matlach, V.: Gender-specific features in contemporary Japanese names. Digital Scholarship in the Humanities 39(2), 467–484 (2024)

  8. [8]

    American Economic Review 94(4), 991–1013 (2004)

    Bertrand, M., Mullainathan, S.: Are Emily and Greg more employable than Lak- isha and Jamal? A field experiment on labor market discrimination. American Economic Review 94(4), 991–1013 (2004)

  9. [9]

    GitHub repository (2024)

    Barešová, I., Nakaya, N., Matlach, V.: Japanese Names Dataset. GitHub repository (2024). https://github.com/oltkkol/japnames

  10. [10]

    So- cius 10(2) (2025)

    Gaebler, J.D., Goel, S., Huq, A., Tambe, P.: Auditing large language models for race and gender disparities: Implications for artificial intelligence-based hiring. So- cius 10(2) (2025)

  11. [11]

    In: Proceedings of EMNLP 2025 (2025)

    Gao, B., Kreiss, E.: Measuring bias or measuring the task: Understanding the brittle nature of LLM gender biases. In: Proceedings of EMNLP 2025 (2025)

  12. [12]

    In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, vol

    Rao, P.S.B., Nagarajan Venkatesan, L., Cherubini, M., Jayagopi, D.B.: Invisible Filters: Cultural Bias in Hiring Evaluations Using Large Language Models. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, vol. 8(3), pp. 2164–2176 (2025). https://ojs.aaai.org/index.php/AIES/article/view/ 36703

  13. [13]

    https://www.mhlw.go

    Ministry of Health, Labour and Welfare Japan (ੜ࿑ಇল): Standard Rirekisho Template (੒ʹ͍ͭͯ ), April 2021. https://www.mhlw.go. jp/content/11601000/000769679.pdf Gender Bias in LLM Hiring Decisions 15

  14. [14]

    https://www.nippon.com/en/japan-data/h01698/

    Nippon.com: One in three Japanese job seekers faces gender discrimination: Rengo survey (2023). https://www.nippon.com/en/japan-data/h01698/

  15. [15]

    Hugging Face Model Repository (2026)

    OpenAI: Privacy Filter. Hugging Face Model Repository (2026). https://huggingface.co/openai/privacy-filter. Model card: https: //cdn.openai.com/pdf/c66281ed-b638-456a-8ce1-97e9f5264a90/ OpenAI-Privacy-Filter-Model-Card.pdf

  16. [16]

    arXiv preprint arXiv:2603.05189 (2026)

    Chen, B., Tan, Z., Khoo, S., Doan, B.N., Liu, Z., Chen, N.F., Lee, R.K.W.: Small Changes, Big Impact: Demographic Bias in LLM-Based Hiring Through Subtle Sociocultural Markers in Anonymised Resumes. arXiv preprint arXiv:2603.05189 (2026)

  17. [17]

    PeerJ Computer Science 12, e3628 (2026)

    Rozado, D.: Gender and positional biases in LLM-based hiring decisions. PeerJ Computer Science 12, e3628 (2026)

  18. [18]

    AI & Society 41, 2841–2861 (2026)

    Sivakaminathan, P., Musi, E.: ChatGPT is a gender bias echo-chamber in HR recruitment: an NLP analysis and framework to uncover the language roots of bias. AI & Society 41, 2841–2861 (2026). https://doi.org/10.1007/ s00146-025-02564-8

  19. [19]

    https://www

    Teikoku Databank:ʢ 2025 ೥). https://www. tdb.co.jp/report/economic/20250822-women2025/

  20. [20]

    In: Findings of ACL: EMNLP 2024, pp

    Wang, Z., et al.: JobFair: A framework for benchmarking gender hiring bias in large language models. In: Findings of ACL: EMNLP 2024, pp. 3227–3246 (2024)

  21. [21]

    Finance & Development 56(1), 26–29 (2019)

    Yamaguchi, K.: Japan’s gender gap. Finance & Development 56(1), 26–29 (2019)

  22. [22]

    Think Name Project, To- hoku University (2024)

    Yoshida, H.: Estimation of the Sato Surname Population. Think Name Project, To- hoku University (2024). https://think-name.jp/assets/pdf/Sato_estimation_ yoshida_hiroshi.pdf

  23. [23]

    In: Findings of the Association for Computa- tional Linguistics: EMNLP 2025, pp

    Hida, R., Kaneko, M., Okazaki, N.: Social Bias Evaluation for Large Language Models Requires Prompt Variations. In: Findings of the Association for Computa- tional Linguistics: EMNLP 2025, pp. 14507–14530 (2025). https://doi.org/10. 18653/v1/2025.findings-emnlp.783

  24. [24]

    Harvard University (2024)

    Harvard College Office of Career Services: Resume and Cover Letter Templates. Harvard University (2024). https://ocs.fas.harvard.edu/ resumes-cover-letters

  25. [25]

    Yokohama National University Student Support Office: Rirekisho Template. https://www.gakuseisupport.ynu.ac.jp/sp/career/student/rireki/ 16 Authors Suppressed Due to Excessive Length A Evaluation Prompt The following shows the full evaluation prompt used in the baseline condition, for a Finance sector candidate applying for the finance manager role. The can...