pith. machine review for the scientific record. sign in

arxiv: 2604.22153 · v1 · submitted 2026-04-24 · 💻 cs.CL · cs.AI· cs.CY

Recognition: unknown

When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords large language modelscultural biasindividualism-collectivismAI ethicsvalue alignmentcross-culturalWorld Values Survey
0
0 comments X

The pith

Frontier AI models deliver Western individualist advice to users worldwide, exceeding local values by 0.76 points on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models give culturally neutral advice by presenting three major AI systems with ten personal dilemmas framed for users in ten countries across five continents. It compares the models' responses on individualism versus collectivism against actual data from the World Values Survey in those countries. The results show a consistent bias toward individualist perspectives that does not match the surveyed preferences of people in most societies. This matters because AI assistants are increasingly used for personal life decisions, potentially spreading a single set of values globally. The study also identifies differences in how models respond to language and country cues.

Core claim

All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p<0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individual

What carries the argument

Cross-cultural comparison of AI responses to ten personal dilemmas framed by country and language against World Values Survey individualism-collectivism scores.

Load-bearing premise

The ten personal dilemmas and their country-specific framings accurately capture and isolate individualism-collectivism differences without introducing unintended prompt biases or cultural mismatches in the scenario design.

What would settle it

If a replication using alternative dilemmas or additional countries finds AI advice matching local survey values with no significant gap, the claim of consistent Western bias would be falsified.

Figures

Figures reproduced from arXiv: 2604.22153 by Pruthvinath Jeripity Venkata.

Figure 1
Figure 1. Figure 1: Study pipeline overview. Left to right: study design (840 API calls: 10 prompts × 3 models across 10 countries and up to 4 conditions per country), prompt assembly, three frontier LLMs at temperature = 0, dual LLM scoring (IC score primary; DeepSeek-V3 sub-dimensions secondary), and misalignment analysis against WVS Wave 7 ground truth. 3.5 Scoring LLM judges. Each response was scored on four dimensions (i… view at source ↗
Figure 2
Figure 2. Figure 2: Mean WVS misalignment scores by model and country. Red = individualist bias; blue = collectivist bias relative to WVS predictions. Countries ordered left to right by increasing individualism (WVS expected score). n = 840; zero refusals. Score Shift Across Experimental Conditions by Model Composite Score (1=collectivist, 6=individualist) Claude Sonnet 4.5 2.50 3.00 3.50 4.00 4.50 C1 C2 C3 C4 GPT-5.4 C1 C2 C… view at source ↗
Figure 3
Figure 3. Figure 3: Composite IC score across experimental conditions (C1–C4) per model. Colored line = model mean ± 95% CI; shaded band = confidence interval. Grey dots = individual country means (jittered horizontally to reduce overlap). C1 = English, no label (USA only); C2 = native language, no label; C3 = native + country label; C4 = English + country label. A C3–C4 gap indicates language carries cultural signal beyond t… view at source ↗
Figure 4
Figure 4. Figure 4: shows prompt-level misalignment with bootstrap 95% CIs. Misalignment by Dilemma Topic −1 0 1 2 Mean Misalignment (+ = individualist bias) Women's career after marriage +1.85 Arranged marriage +1.80 Unhappy marriage +1.44 Mental health +1.05 Question doctor +0.65 Report family +0.55 Career vs parents +0.47 Eldest abroad +0.10 Religion vs career −0.04 Challenge manager −0.26 view at source ↗
Figure 5
Figure 5. Figure 5: Mean sub-dimension scores by dilemma prompt (DeepSeek-V3 judge, n = 840). Rows = sub-dimensions; columns = prompts ordered left to right by mean IC misalignment. Red = above neutral (> 3.0), blue = below neutral (< 3.0), white = neutral. Autonomy is near ceiling across all prompts; Family Orientation is below neutral for religion and marriage scenarios. Sub-dimension scores by model (DeepSeek-V3 judge, n =… view at source ↗
Figure 6
Figure 6. Figure 6: Per-model mean sub-dimension scores (DeepSeek￾V3 judge, n = 840). Bars show mean score per dimension per model; dashed line = neutral (3.0); t-statistics (vs. 3.0) shown below each dimension label. All three models score well above neutral on IC Score (≈4.0) and Autonomy (≈4.5); Authority Deference is marginally above neutral; Family Ori￾entation is significantly below neutral (≈2.8, t = −7.5 ∗∗∗), confirm… view at source ↗
Figure 7
Figure 7. Figure 7: Forest plot of mixed-effects estimates. Dots = fixed￾effect estimates; error bars = ±1 SE; dashed vertical = zero; grey dashed = original t-test intercept (+0.888). Reference category = Claude Sonnet 4.5. Three random-intercept speci￾fications (prompt-only, country-only, prompt×country) yield stable estimates. ICCprompt = 0.27, ICCcountry = 0.19 confirm moderate clustering that does not overturn the main r… view at source ↗
read the original abstract

When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT-5.4, and Gemini 2.5 Flash) with ten real-life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p<0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individualist; GPT-5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper audits three frontier LLMs (Claude Sonnet 4.5, GPT-5.4, Gemini 2.5 Flash) on ten personal dilemmas framed for users from ten countries across five continents and seven languages (n=840 scored responses). It reports that all models produce significantly more individualist advice than World Values Survey Wave 7 country norms predict (mean gap +0.76 on a 1-5 scale, t=15.65, p<0.001), with largest deviations in Nigeria and India, a reversal in Japan, and model-specific sensitivities to language versus stated country identity. The authors conclude this indicates systemic value homogenization and release data, code, and scoring pipeline.

Significance. If the measured gap is robust, the work provides concrete evidence of cultural bias in deployed LLMs and quantifies its magnitude against an external benchmark, with direct implications for AI safety, fairness, and global deployment. The open release of the full dataset, code, and scoring pipeline is a clear strength that enables independent verification and extension.

major comments (2)
  1. [Methods] Methods section: The abstract states that dilemmas were 'framed for users from 10 countries' but supplies no description of dilemma selection criteria, exact adaptation process for each country/language, scoring rubric for individualism-collectivism, or any pilot validation that the framings preserve construct equivalence (e.g., whether family-conflict items load identically on collectivism in Nigeria versus Japan). This information is load-bearing for the central claim that the +0.76 gap reflects model values rather than prompt artifacts.
  2. [Results] Results: The headline t-test and per-country gaps (Nigeria +1.85, India +0.82) rest on the assumption that the 840 scored responses isolate individualism-collectivism; without reported inter-rater reliability, controls for prompt wording, or equivalence checks across the seven languages, it remains possible that translation asymmetries or implicit Western defaults in the framings contribute to the observed difference versus WVS Wave 7.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it stated the exact distribution of responses per model and per country rather than only the aggregate n=840.
  2. [Introduction] Terminology such as 'Western-style, individualist advice' should be explicitly anchored to the specific WVS items or scoring dimensions used, to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify areas where the manuscript can be strengthened with additional methodological transparency and robustness checks. We address each point below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Methods] Methods section: The abstract states that dilemmas were 'framed for users from 10 countries' but supplies no description of dilemma selection criteria, exact adaptation process for each country/language, scoring rubric for individualism-collectivism, or any pilot validation that the framings preserve construct equivalence (e.g., whether family-conflict items load identically on collectivism in Nigeria versus Japan). This information is load-bearing for the central claim that the +0.76 gap reflects model values rather than prompt artifacts.

    Authors: We agree that the current Methods section is insufficiently detailed on these points. In the revised manuscript we will add a dedicated subsection that specifies: (1) the criteria used to select the ten dilemmas (coverage of career, family, autonomy, and authority domains drawn from prior cross-cultural psychology literature); (2) the exact adaptation protocol, including native-speaker translation, back-translation, and cultural localization steps for each of the seven languages; (3) the complete scoring rubric with anchor examples for each point on the 1-5 individualism-collectivism scale; and (4) pilot validation results from an independent sample of 40 responses in which two cultural psychologists assessed construct equivalence via item-level correlations with World Values Survey items and expert ratings of cultural appropriateness. These additions will directly address the concern that the observed gap could be an artifact of prompt construction. revision: yes

  2. Referee: [Results] Results: The headline t-test and per-country gaps (Nigeria +1.85, India +0.82) rest on the assumption that the 840 scored responses isolate individualism-collectivism; without reported inter-rater reliability, controls for prompt wording, or equivalence checks across the seven languages, it remains possible that translation asymmetries or implicit Western defaults in the framings contribute to the observed difference versus WVS Wave 7.

    Authors: We accept that additional statistical safeguards are warranted. The revised Results section will report inter-rater reliability (Cohen’s kappa and percentage agreement) obtained from a second independent scorer on a 20% random subsample of the 840 responses. We will also add (a) a sensitivity analysis that varies prompt wording while holding country and language constant and (b) language-specific equivalence checks (multigroup confirmatory factor analysis on the scored items). While we continue to view the core finding as robust—given its consistency across three models, ten countries, and the fact that the largest gaps appear in non-Western contexts—these controls will be included to rule out translation or framing confounds. The full dataset and scoring code are already public, enabling external verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim rests on independent external benchmark

full rationale

The paper's derivation consists of prompting three LLMs with ten dilemmas framed by country, scoring the 840 responses for individualism-collectivism, and computing the mean gap against World Values Survey Wave 7 country-level data. This gap (+0.76, t=15.65) is obtained by direct comparison to an external, pre-existing survey rather than by fitting parameters to the LLM outputs themselves or by any self-referential definition. No equations, self-citations, ansatzes, or uniqueness theorems are invoked that would reduce the result to the inputs by construction. The design is therefore self-contained against an independent benchmark and falsifiable outside the paper's own data collection.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the premise that the World Values Survey provides a valid external ground truth for cultural values and that the chosen dilemmas plus country framing isolate the individualism-collectivism dimension without confounding factors.

axioms (1)
  • domain assumption World Values Survey Wave 7 accurately measures country-level individualism-collectivism preferences that serve as the appropriate benchmark for AI advice.
    The paper directly compares AI scores to these survey values as the reference for 'local values would predict'.

pith-pipeline@v0.9.0 · 5593 in / 1426 out tokens · 108645 ms · 2026-05-08T12:04:18.418814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Anthropic

    Cultural palette: Pluralising culture alignment via multi-agent palette.arXiv preprint arXiv:2412.11167. Anthropic. 2025a. Claude sonnet 4.5 model card. Tech- nical Report. Anthropic. 2025b. Values in the wild: Understanding what Claude values in real conversations. Anthropic Research Blog. Arnav Arora, Lucie-Aimée Kaffee, and Isabelle Augen- stein

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073. Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich

  3. [3]

    Geert Hofstede

    The weirdest people in the world?Behavioral and Brain Sciences, 33(2–3):61–83. Geert Hofstede. 2001.Culture’s Consequences: Com- paring Values, Behaviors, Institutions, and Organiza- tions Across Nations, 2nd edition. Sage Publications, Thousand Oaks, CA. Masoud Jalali Jivan, Sina Abdous, Negar Arabzadeh, and Charles L. A. Clarke

  4. [4]

    LLM-GLOBE: A benchmark evaluating the cultural values embedded in LLM output.arXiv preprint arXiv:2411.06032. J. Richard Landis and Gary G. Koch

  5. [5]

    arXiv preprint arXiv:2501.07071

    Value compass benchmarks: A comprehensive, generative and self- evolving platform for LLMs’ value evaluation. arXiv preprint arXiv:2501.07071. Microsoft Research Asia

  6. [6]

    Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

    Distributional open- ended evaluation of LLM cultural value align- ment based on value codebook.arXiv preprint arXiv:2604.06210. Roberto Navigli, Simone Conia, and Björn Ross

  7. [7]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foun- dation and fine-tuned chat models.arXiv preprint arXiv:2307.09288. World Values Survey Association

  8. [8]

    I am from {country}

    World values survey wave 7 (2017–2022). JD Systems Institute & WVSA Secretariat. Version 5.0. A Prompt Texts All 10 English prompt texts are reproduced verba- tim below. For conditions C3 and C4, the single sentence“I am from {country}. ”was appended to the prompt text before the final question. For non-English conditions (C2, C3), prompts were machine-tr...