Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

Jiwoo Choi; Seohyon Jung; Seonwoo Ahn; Tongxin Zhang

arxiv: 2605.30804 · v1 · pith:4RAX2GI5new · submitted 2026-05-29 · 💻 cs.CL

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

Jiwoo Choi , Seonwoo Ahn , Tongxin Zhang , Seohyon Jung This is my paper

Pith reviewed 2026-06-28 23:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords gender stereotypinglarge language modelscross-lingual auditHEXACO personality inventoryhuman baselinesbias patternstranslation effects

0 comments

The pith

LLM gender stereotyping on HEXACO-100 spans a range 2.5 times wider than the full variation across 48 human countries and compounds across languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies the HEXACO-100 personality inventory to six LLMs in English, Korean, Chinese, and Japanese, then compares the resulting gender attributions directly to human response distributions from 48 countries. It reports that model stereotyping covers a spread roughly 2.5 times the entire human cross-country range, with one English-centric model reaching five times the local human baseline even in Korean. Translation does not merely rescale the strength of stereotypes but rearranges which specific traits receive gendered attributions. A four-pattern framework classifies the observed behaviors as concordance, suppression, reorganization, or amplification. The results indicate that no single debiasing approach is likely to produce uniform outcomes across linguistic boundaries.

Core claim

By anchoring LLM responses on the HEXACO-100 personality inventory to human data from 48 countries, the study shows that gender stereotyping in large language models spans a range 2.5 times wider than human cross-country differences, with the discrepancy able to compound when models are prompted in non-primary languages; item-level analysis further reveals that translation rearranges the content of stereotypes rather than simply scaling their intensity.

What carries the argument

The HEXACO-100 personality inventory applied to LLMs and compared to human baselines from 48 countries, together with the four-pattern framework of concordance, suppression, reorganization, and amplification.

Load-bearing premise

The HEXACO-100 provides a valid and comparable measure of gender stereotyping when given to LLMs as when given to human respondents across cultures and languages.

What would settle it

Demonstration that the spread of gender attributions produced by the tested LLMs on HEXACO-100 items falls inside the observed human cross-country range in every language examined.

Figures

Figures reproduced from arXiv: 2605.30804 by Jiwoo Choi, Seohyon Jung, Seonwoo Ahn, Tongxin Zhang.

**Figure 1.** Figure 1: Emotionality Cohen’s d across 24 (model × language) cells compared to cross-cultural human baselines from Lee and Ashton (2020). Error bars indicate 95% bootstrap confidence intervals. We calculate a bootstrap 95% CI with B = 2000 resamples per cell. For human baselines, we take the cohort-level Cohen’s d values reported for regions of Japan and South Korea in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: Facet-level decomposition of Emotionality for Syn [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Facet-level Emotionality breakdown for Hy [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Spearman’s rank correlation (ρ) matrices of itemlevel Female−Male keyed differences (n = 16 Emotionality items) across prompt languages for each model. ∗ indicates p < 0.05. The preceding sections compared aggregate factor- and facet-level stereotype magnitudes across (model × language) cells. A related question is whether translation changes which items drive the gender gap, holding the model constant.… view at source ↗

read the original abstract

We audit six large language models (LLMs) for gender stereotyping across English, Korean, Chinese, and Japanese. Three were developed primarily for English-language use (Claude, GPT, Gemini) and three for East Asian use (DeepSeek, Syn-Pro, HyperCLOVA X). We adopt the HEXACO-100 personality inventory and anchor each model against a cross-cultural human dataset spanning 48 countries to ask not whether LLMs are biased, but how far their gender attributions drift from the populations they are deployed among. Our findings show that their stereotyping spans a range roughly 2.5 times wider than the entire cross-country range found in humans, and the effect can compound across languages. One English-centric model, prompted in Korean, reached 5 times the local baseline, even when the prompt stated the candidate had already been hired, which often dampens human stereotyping. To characterize such behaviors without ranking them, we introduce a four-pattern framework -- concordance, suppression, reorganization, and amplification -- across 24 (model x language) cells. Item-level analysis reveals that translation does not just rescale stereotypes, but changes the attributes tied to it, hiding significant rearrangement under the surface while appearing well-calibrated. Our results ultimately suggest that no single debiasing pipeline is likely to address bias evenly across linguistic boundaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper anchors LLM gender attributions to a 48-country human HEXACO dataset and reports a 2.5x wider range with a four-pattern descriptive frame, but the direct comparability of model responses to human trait data is not established.

read the letter

The core contribution is the attempt to quantify how far six LLMs drift from human population baselines on gender-linked traits when prompted in English, Korean, Chinese, and Japanese. They report the models' stereotyping range is roughly 2.5 times the full cross-country human range, with one English-centric model hitting 5 times the local baseline even under a hired-candidate prompt, and they organize the deviations into concordance, suppression, reorganization, and amplification.

This moves past simple bias detection by tying the numbers to real deployment populations and by showing that translation does not merely rescale stereotypes but rearranges which specific attributes get linked to gender. That item-level observation is the clearest new angle.

The soft spot is the measurement equivalence. The paper treats HEXACO-100 responses from prompted models as directly comparable to human self-reports across cultures. Models generate next-token predictions conditioned on prompt text; they lack the stable trait structures the inventory was designed for. Without reported checks for response consistency, prompt sensitivity, or translation artifacts, the 2.5x and 5x figures could partly reflect surface associations or measurement mismatch rather than genuine drift magnitude. The stress-test concern lands on the evidence given.

The work is aimed at researchers and practitioners who need multilingual fairness metrics grounded in population data rather than abstract ideals. It shows clear engagement with cross-lingual issues and prior human datasets.

It deserves peer review so the methods section can be examined on the comparability controls and statistical handling.

Referee Report

2 major / 2 minor

Summary. The paper audits six LLMs (three English-centric, three East Asian) for gender stereotyping via HEXACO-100 prompts in English, Korean, Chinese, and Japanese. It anchors results against a 48-country human dataset and reports that LLM stereotyping ranges are roughly 2.5 times wider than the full human cross-country range, with compounding across languages (one English-centric model in Korean reaching 5x the local baseline even after a 'hired' prompt). It introduces a four-pattern framework (concordance, suppression, reorganization, amplification) across 24 model-language cells and notes that translation rearranges rather than merely rescales the attributes tied to stereotypes.

Significance. If the direct comparability of LLM and human HEXACO-100 responses holds, the results would demonstrate substantial, language-dependent drift in gender attributions that exceeds observed human variation and resists uniform mitigation, with the four-pattern framework offering a structured, non-ranking lens for cross-lingual analysis.

major comments (2)

[Abstract / Methods] The 2.5x and 5x claims (abstract) rest on treating prompted LLM responses to the HEXACO-100 as a directly comparable metric of gender stereotyping to the 48-country human dataset. The manuscript provides no validation, stability checks, or controls demonstrating that these outputs reflect trait attribution structures rather than surface-level token associations or translation artifacts; this equivalence is load-bearing for interpreting the results as drift magnitude.
[Results / Item-level analysis] The item-level analysis (abstract) asserts that translation 'changes the attributes tied to' stereotypes and produces 'significant rearrangement.' Without reported controls for prompt phrasing, response consistency across repeated queries, or explicit comparison of item loadings between LLM and human data, it is unclear whether observed differences exceed what would be expected from measurement mismatch alone.

minor comments (2)

[Framework] The four-pattern framework is introduced without an explicit decision rule or inter-rater procedure for assigning cells to concordance/suppression/reorganization/amplification; a table or pseudocode definition would improve reproducibility.
[Abstract] The abstract states that the 'hired' prompt 'often dampens human stereotyping' but does not cite the specific human-study reference or report the corresponding LLM effect size for that condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and agree that additional methodological controls will strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Methods] The 2.5x and 5x claims (abstract) rest on treating prompted LLM responses to the HEXACO-100 as a directly comparable metric of gender stereotyping to the 48-country human dataset. The manuscript provides no validation, stability checks, or controls demonstrating that these outputs reflect trait attribution structures rather than surface-level token associations or translation artifacts; this equivalence is load-bearing for interpreting the results as drift magnitude.

Authors: We agree that the direct comparability assumption is central to interpreting the reported drift magnitudes. The original manuscript relies on the standardized nature of the HEXACO-100 instrument and identical prompting across models and languages but does not include explicit stability checks or controls for token-level associations. In revision we will add repeated-prompt consistency analyses and controls comparing responses to semantically matched but non-personality prompts to demonstrate that the observed patterns exceed surface artifacts. revision: yes
Referee: [Results / Item-level analysis] The item-level analysis (abstract) asserts that translation 'changes the attributes tied to' stereotypes and produces 'significant rearrangement.' Without reported controls for prompt phrasing, response consistency across repeated queries, or explicit comparison of item loadings between LLM and human data, it is unclear whether observed differences exceed what would be expected from measurement mismatch alone.

Authors: We concur that stronger controls are needed to isolate genuine rearrangement from measurement mismatch. The current item-level claims rest on direct comparison of HEXACO item responses but lack the requested checks. We will incorporate prompt-variation robustness tests, repeated-query consistency metrics, and factor-loading comparisons between LLM and human data in the revised version to substantiate the reorganization pattern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical anchoring to external 48-country HEXACO-100 human dataset

full rationale

The paper measures LLM responses on the HEXACO-100 inventory across languages and directly compares the resulting gender-attribution ranges and patterns to an independent cross-cultural human dataset. No parameters are fitted to the target LLM outputs, no predictions are derived from the same data used for measurement, and the four-pattern framework (concordance, suppression, reorganization, amplification) is introduced as a descriptive taxonomy rather than a deductive result. The central claims rest on external benchmarks, satisfying the self-contained criterion with no load-bearing self-citations or definitional reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central measurement rests on the assumption that a personality inventory developed for humans can be applied directly to LLMs to quantify stereotyping in a cross-culturally meaningful way.

axioms (1)

domain assumption HEXACO-100 provides a valid and comparable measure of gender stereotyping for both human respondents and LLM outputs across languages.
Used to anchor all model outputs to the 48-country human dataset.

pith-pipeline@v0.9.1-grok · 5776 in / 1223 out tokens · 23525 ms · 2026-06-28T23:07:53.495670+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Morten Moshagen, Isabel Thielmann, Benjamin E Hilbig, and Ingo Zettler

Social categorization and discriminatory be- havior: Extinguishing the minimal intergroup dis- crimination effect.Journal of personality and Social Psychology, 39(5):773. Morten Moshagen, Isabel Thielmann, Benjamin E Hilbig, and Ingo Zettler. 2019. Meta-analytic in- vestigations of the hexaco personality inventory (- revised).Zeitschrift für Psychologie. ...

2019
[2]

Large language models can replicate cross- cultural differences in personality.Journal of Re- search in Personality, 115:104584. OpenAI, Josh Achiam, Steven Adler, Sandhini Agar- wal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, P...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

HyperCLOVA X technical report.Preprint at https://arxiv.org/abs/2404.01954(2024)

Hyperclova x technical report.Preprint, arXiv:2404.01954. A Prompt Design and Anti-Rote Scale Rotation This appendix documents the exact prompts administered to each model, the anti-rote scale rotation procedure, and the item-order randomization used in each run. All materials were identical across the six models; only the API endpoint differed. A.1 Syste...

work page arXiv 2020

[1] [1]

Morten Moshagen, Isabel Thielmann, Benjamin E Hilbig, and Ingo Zettler

Social categorization and discriminatory be- havior: Extinguishing the minimal intergroup dis- crimination effect.Journal of personality and Social Psychology, 39(5):773. Morten Moshagen, Isabel Thielmann, Benjamin E Hilbig, and Ingo Zettler. 2019. Meta-analytic in- vestigations of the hexaco personality inventory (- revised).Zeitschrift für Psychologie. ...

2019

[2] [2]

Large language models can replicate cross- cultural differences in personality.Journal of Re- search in Personality, 115:104584. OpenAI, Josh Achiam, Steven Adler, Sandhini Agar- wal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, P...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

HyperCLOVA X technical report.Preprint at https://arxiv.org/abs/2404.01954(2024)

Hyperclova x technical report.Preprint, arXiv:2404.01954. A Prompt Design and Anti-Rote Scale Rotation This appendix documents the exact prompts administered to each model, the anti-rote scale rotation procedure, and the item-order randomization used in each run. All materials were identical across the six models; only the API endpoint differed. A.1 Syste...

work page arXiv 2020