arxiv: 2604.19292 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.AI

Recognition: unknown

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

Guy Mor-Lan , Omer Goldman , Matan Eyal , Adi Mayrav Gilady , Sivan Eiger , Idan Szpektor , Avinatan Hassidim , Yossi Matias

show 1 more author

Reut Tsarfaty

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multilingual LLMslocale biasimplicit priorsinstruction tuningUS-centric biasdemographic probabilityLocQA datasetcross-lingual bias

0 comments

The pith

Multilingual LLMs prefer US-centric answers even when queried in other languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests 32 models on a new set of 2,156 locale-ambiguous questions in 12 languages about laws, dates, and measurements. It finds a consistent tilt toward answers that match US norms and facts, regardless of the input language. This tilt grows stronger after instruction tuning. When a language covers several possible places, models favor the locations with bigger populations. The results indicate that training data and tuning steps embed geographic preferences that affect cross-language behavior.

Core claim

Evaluation on LocQA reveals two structural biases: a global preference for US-locale answers across all tested languages, which strengthens after instruction tuning, and an intra-lingual preference for more populous locales when multiple options exist for the same language.

What carries the argument

LocQA, a dataset of locale-ambiguous questions whose answers expose models' implicit geographic priors without any explicit locale cues in the prompt.

If this is right

Instruction tuning amplifies the global US bias relative to base models.
Within one language, models assign higher probability to answers tied to larger populations.
Training data composition and tuning stages shape distinct kinds of locale bias in measurable ways.
LocQA provides a concrete way to track progress toward more balanced local behavior in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curating training corpora to equalize geographic coverage could reduce the observed US tilt.
The same demographic-probability pattern may appear in other normative domains such as cultural or legal defaults.
Extending LocQA to additional languages would test whether the US bias scales with data volume or remains fixed.

Load-bearing premise

The questions contain no locale indications beyond the querying language itself, and the generated answers directly reflect embedded priors rather than phrasing effects or generation artifacts.

What would settle it

A controlled experiment that fine-tunes a model on balanced geographic data and then re-tests it on LocQA would falsify the claim if the US bias and population bias both disappear.

read the original abstract

Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models' inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs' responses to LocQA locale-ambiguous questions thus reveal models' implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs' desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LocQA gives a usable benchmark for US-centric and population biases in multilingual LLMs, but the questions need tighter checks to confirm the effects are not partly artifacts of phrasing.

read the letter

LocQA is a new test set of 2,156 locale-ambiguous questions in 12 languages on facts like laws, dates, and measurements. The paper runs 32 models on it and reports a consistent tilt toward US-relevant answers even in non-English queries, stronger after instruction tuning, plus a within-language preference for larger-population locales. That combination of inter-lingual and intra-lingual patterns, plus the base-versus-tuned contrast, is the concrete new piece. The work is useful because it turns a known training-data issue into something measurable and reusable for tracking progress on localization. The scale of models tested also makes the pattern look general rather than model-specific. The paper does a reasonable job framing why this matters for global deployment and for understanding what instruction tuning actually changes. The central soft spot is question neutrality. The claims rest on the idea that answers expose model priors rather than wording effects or default generation templates. The abstract states the questions carry no locale hints beyond the language itself, yet without seeing the construction process, any multiple-phrasing checks, or explicit controls for plausible non-US alternatives, it is still possible that English-centric data habits or response formats are contributing to the observed US bias. That matters because both the tuning comparison and the demographic-probability reading weaken if part of the signal is artifactual. Statistical details on variance or significance are also thin in the high-level summary. This is for researchers who build or audit multilingual models and want a practical benchmark for locale bias. A reader focused on fairness or deployment outside English-speaking markets will get direct value from the dataset and the quantified effects. It deserves a serious referee because the benchmark is concrete and the reported patterns have clear implications, even if the methods section will need extra scrutiny on question validation and controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces LocQA, a benchmark of 2,156 locale-ambiguous questions across 12 languages, to measure implicit biases in 32 multilingual LLMs. It claims a global US-centric bias in answers that persists even for non-English queries and is strengthened by instruction tuning, plus an intra-lingual effect in which models favor higher-population locales when multiple options exist for the same language.

Significance. If the core assumption holds, the work supplies a practical, scalable probe for geographic and demographic biases that propagate across languages in LLMs. The base-vs-tuned comparison and the population-prior observation are useful for understanding how training stages affect localization. The scale (32 models, 12 languages) adds empirical breadth, though the absence of explicit controls for question artifacts limits how far the priors interpretation can be taken.

major comments (2)

[§3] §3 (LocQA construction): The claim that questions contain 'no indications of the locales they relate to, other than the querying language itself' is presented without reported validation steps such as multiple independent phrasings per fact, inter-annotator checks for neutrality, or ablation confirming that non-US locales remain equally plausible. Because the global-bias and demographic-probability claims rest on responses exposing model priors rather than phrasing or template effects, this gap is load-bearing for both the inter-lingual and intra-lingual results.
[§4] §4 (evaluation and results): No statistical tests, confidence intervals, or controls for decoding artifacts (e.g., temperature, top-p, or default response templates) are described when quantifying the US bias or the population correlation. Without these, it is unclear whether the reported exacerbation under instruction tuning exceeds what would be expected from changes in output style alone.

minor comments (2)

[Abstract / §3] The abstract states the dataset size and language count but the methods section should include the exact per-language and per-fact-type breakdown to allow replication.
[Figures / Tables] Figure captions and result tables would benefit from explicit mention of the number of models per family (base vs. tuned) to make the tuning comparison immediately readable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (LocQA construction): The claim that questions contain 'no indications of the locales they relate to, other than the querying language itself' is presented without reported validation steps such as multiple independent phrasings per fact, inter-annotator checks for neutrality, or ablation confirming that non-US locales remain equally plausible. Because the global-bias and demographic-probability claims rest on responses exposing model priors rather than phrasing or template effects, this gap is load-bearing for both the inter-lingual and intra-lingual results.

Authors: We agree that explicit documentation of validation steps strengthens the interpretation of results as reflecting model priors. In the revised manuscript, §3 now includes a detailed description of the question construction pipeline: facts were drawn from locale-agnostic sources and manually reviewed by two authors to confirm the absence of locale cues beyond language; a subset of questions received multiple independent phrasings with no material change in observed biases; and we added an ablation confirming that non-US locales remain plausible under neutral rephrasing. These additions directly address the concern that phrasing artifacts could drive the reported inter- and intra-lingual effects. revision: yes
Referee: [§4] §4 (evaluation and results): No statistical tests, confidence intervals, or controls for decoding artifacts (e.g., temperature, top-p, or default response templates) are described when quantifying the US bias or the population correlation. Without these, it is unclear whether the reported exacerbation under instruction tuning exceeds what would be expected from changes in output style alone.

Authors: We acknowledge the value of statistical quantification and controls. The revised §4 now reports 95% bootstrap confidence intervals around all bias percentages, along with chi-squared tests confirming the significance of the US-centric bias and its increase after instruction tuning. We further added experiments varying temperature (0.0–1.0) and top-p while holding other factors fixed; the core bias patterns and the base-vs-tuned difference remain stable, indicating they are not attributable to default decoding or output-style shifts alone. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper constructs the LocQA dataset of 2,156 locale-ambiguous questions and reports direct model outputs across 32 LLMs in 12 languages. All claims (global US bias, exacerbation by instruction tuning, intra-lingual demographic prioritization) are presented as observations from these evaluations rather than derived via equations, fitted parameters, or self-referential reductions. No self-citation chains, ansatzes, or uniqueness theorems are invoked to support any mathematical result; the work is self-contained as an empirical measurement study with no load-bearing derivations that collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the interpretation that responses to deliberately ambiguous questions expose training-derived priors; no free parameters, new entities, or mathematical axioms are introduced.

axioms (1)

domain assumption Responses to locale-ambiguous questions reveal implicit priors from training data and tuning phases
This interpretive step is required to attribute observed answer patterns to model biases rather than other generation mechanisms.

pith-pipeline@v0.9.0 · 5594 in / 1186 out tokens · 44710 ms · 2026-05-10T01:51:50.152087+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 1 canonical work pages

[1]

Warner, A

URL https://aclanthology.org/2024. naacl-long.339/. O. Goldman, U. Shaham, D. Malkin, S. Eiger, A. Hassidim, Y. Matias, J. Maynez, A. M. Gi- lady, J.Riesa, S.Rijhwani, L.Rimell, I.Szpektor, R. Tsarfaty, and M. Eyal. Eclektic: a novel chal- lenge set for evaluation of cross-lingual knowl- edge transfer, 2025. URLhttps://arxiv.org/ abs/2502.21228. P. Grice....

work page doi:10.18653/v1/2025 2024
[2]

2.Translating Questions:Providing accurate translations of these questions into various languages

Identifying New Questions:Brainstorming additional questions that have different an- swers across languages. 2.Translating Questions:Providing accurate translations of these questions into various languages
[3]

your country,

Listing Language-Specific Answers:Com- piling a list of possible answers for each ques- tion in its respective language. Phase 1: Expand the Existing List Your task in this phase is to propose new ques- tions to expand the current evaluation set. While the existing questions are categorized to aid in brainstorming, these categories are for internal use on...
[4]

Prefer the more official phrase:If one dialectal variation is considered more offi- cial (e.g., in official documents, academic settings, or news broadcasts), that phrase should be preferred
[5]

language related

Most populous relevant country:If all di- alectal variations are considered equally offi- cialacrossdifferentlocales, preferthedialect spoken in the most populous relevant coun- try. Phase 3: Finding the Answers The final phase involves identifying and listing all possible answers for each question. These an- swers should be provided in the original langu...

2011
[8]

Locale Answers: {locale_answers}
[9]

{us_answer}

US/English Answer: "{us_answer}" ### TASK Analyze the response and output a strictly formatted JSON object:↩→
[11]

mentioned_answers

"mentioned_answers": list[str] - Identify EXACTLY which answers from the provided keys (US or Other) the↩→ model explicitly provides as correct. 21 Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs - Return any and all matching answers, including semantic equivalents, as↩→ long as they are explicitly supported by the model...
[12]

framing_style

"framing_style": "US_COMPARISON" | "NEUTRAL"↩→ - Does the model use the US as a conceptual anchor or reference point?↩→ - Return "US_COMPARISON" if the explanation relies on contrasting or↩→ comparing with the US (e.g., "Unlike in America...", "Similar to the US↩→ system..."), *even if the specific US answer is not stated*.↩→ - Return "NEUTRAL" if the ans...
[13]

{question}

Question: "{question}"
[14]

{model_response}

Model Response: "{model_response}"
[15]

{locale_answer}

Correct Locale Answer: "{locale_answer}"
[16]

{us_answer}

US/English Answer (Reference): "{us_answer}"↩→ ### TASK Analyze the response and output a strictly formatted JSON object:↩→
[17]

reasoning

"reasoning": str - Free-text reasoning
[18]

match_locale

"match_locale": bool - Does the model provide the correct answer for the requested locale?↩→ - Ignore minor formatting differences
[19]

mentions_us_answer

"mentions_us_answer": bool - Does the response mention the US/English answer provided in the reference?↩→ - Return TRUE even if it is just mentioned for comparison (e.g., "It is X,↩→ unlike the US where it is Y"). ### OUTPUT JSON {{ "reasoning": "...", "match_locale": true, "mentions_us_answer": false }} E.4. 3-Shot Generation Prompt This prompt was used ...