Recognition: unknown
Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs
Pith reviewed 2026-05-10 01:51 UTC · model grok-4.3
The pith
Multilingual LLMs prefer US-centric answers even when queried in other languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluation on LocQA reveals two structural biases: a global preference for US-locale answers across all tested languages, which strengthens after instruction tuning, and an intra-lingual preference for more populous locales when multiple options exist for the same language.
What carries the argument
LocQA, a dataset of locale-ambiguous questions whose answers expose models' implicit geographic priors without any explicit locale cues in the prompt.
If this is right
- Instruction tuning amplifies the global US bias relative to base models.
- Within one language, models assign higher probability to answers tied to larger populations.
- Training data composition and tuning stages shape distinct kinds of locale bias in measurable ways.
- LocQA provides a concrete way to track progress toward more balanced local behavior in future models.
Where Pith is reading between the lines
- Curating training corpora to equalize geographic coverage could reduce the observed US tilt.
- The same demographic-probability pattern may appear in other normative domains such as cultural or legal defaults.
- Extending LocQA to additional languages would test whether the US bias scales with data volume or remains fixed.
Load-bearing premise
The questions contain no locale indications beyond the querying language itself, and the generated answers directly reflect embedded priors rather than phrasing effects or generation artifacts.
What would settle it
A controlled experiment that fine-tunes a model on balanced geographic data and then re-tests it on LocQA would falsify the claim if the US bias and population bias both disappear.
read the original abstract
Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models' inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs' responses to LocQA locale-ambiguous questions thus reveal models' implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs' desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LocQA, a benchmark of 2,156 locale-ambiguous questions across 12 languages, to measure implicit biases in 32 multilingual LLMs. It claims a global US-centric bias in answers that persists even for non-English queries and is strengthened by instruction tuning, plus an intra-lingual effect in which models favor higher-population locales when multiple options exist for the same language.
Significance. If the core assumption holds, the work supplies a practical, scalable probe for geographic and demographic biases that propagate across languages in LLMs. The base-vs-tuned comparison and the population-prior observation are useful for understanding how training stages affect localization. The scale (32 models, 12 languages) adds empirical breadth, though the absence of explicit controls for question artifacts limits how far the priors interpretation can be taken.
major comments (2)
- [§3] §3 (LocQA construction): The claim that questions contain 'no indications of the locales they relate to, other than the querying language itself' is presented without reported validation steps such as multiple independent phrasings per fact, inter-annotator checks for neutrality, or ablation confirming that non-US locales remain equally plausible. Because the global-bias and demographic-probability claims rest on responses exposing model priors rather than phrasing or template effects, this gap is load-bearing for both the inter-lingual and intra-lingual results.
- [§4] §4 (evaluation and results): No statistical tests, confidence intervals, or controls for decoding artifacts (e.g., temperature, top-p, or default response templates) are described when quantifying the US bias or the population correlation. Without these, it is unclear whether the reported exacerbation under instruction tuning exceeds what would be expected from changes in output style alone.
minor comments (2)
- [Abstract / §3] The abstract states the dataset size and language count but the methods section should include the exact per-language and per-fact-type breakdown to allow replication.
- [Figures / Tables] Figure captions and result tables would benefit from explicit mention of the number of models per family (base vs. tuned) to make the tuning comparison immediately readable.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3] §3 (LocQA construction): The claim that questions contain 'no indications of the locales they relate to, other than the querying language itself' is presented without reported validation steps such as multiple independent phrasings per fact, inter-annotator checks for neutrality, or ablation confirming that non-US locales remain equally plausible. Because the global-bias and demographic-probability claims rest on responses exposing model priors rather than phrasing or template effects, this gap is load-bearing for both the inter-lingual and intra-lingual results.
Authors: We agree that explicit documentation of validation steps strengthens the interpretation of results as reflecting model priors. In the revised manuscript, §3 now includes a detailed description of the question construction pipeline: facts were drawn from locale-agnostic sources and manually reviewed by two authors to confirm the absence of locale cues beyond language; a subset of questions received multiple independent phrasings with no material change in observed biases; and we added an ablation confirming that non-US locales remain plausible under neutral rephrasing. These additions directly address the concern that phrasing artifacts could drive the reported inter- and intra-lingual effects. revision: yes
-
Referee: [§4] §4 (evaluation and results): No statistical tests, confidence intervals, or controls for decoding artifacts (e.g., temperature, top-p, or default response templates) are described when quantifying the US bias or the population correlation. Without these, it is unclear whether the reported exacerbation under instruction tuning exceeds what would be expected from changes in output style alone.
Authors: We acknowledge the value of statistical quantification and controls. The revised §4 now reports 95% bootstrap confidence intervals around all bias percentages, along with chi-squared tests confirming the significance of the US-centric bias and its increase after instruction tuning. We further added experiments varying temperature (0.0–1.0) and top-p while holding other factors fixed; the core bias patterns and the base-vs-tuned difference remain stable, indicating they are not attributable to default decoding or output-style shifts alone. revision: yes
Circularity Check
No circularity: purely empirical benchmark evaluation
full rationale
The paper constructs the LocQA dataset of 2,156 locale-ambiguous questions and reports direct model outputs across 32 LLMs in 12 languages. All claims (global US bias, exacerbation by instruction tuning, intra-lingual demographic prioritization) are presented as observations from these evaluations rather than derived via equations, fitted parameters, or self-referential reductions. No self-citation chains, ansatzes, or uniqueness theorems are invoked to support any mathematical result; the work is self-contained as an empirical measurement study with no load-bearing derivations that collapse to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Responses to locale-ambiguous questions reveal implicit priors from training data and tuning phases
Reference graph
Works this paper leans on
-
[1]
URL https://aclanthology.org/2024. naacl-long.339/. O. Goldman, U. Shaham, D. Malkin, S. Eiger, A. Hassidim, Y. Matias, J. Maynez, A. M. Gi- lady, J.Riesa, S.Rijhwani, L.Rimell, I.Szpektor, R. Tsarfaty, and M. Eyal. Eclektic: a novel chal- lenge set for evaluation of cross-lingual knowl- edge transfer, 2025. URLhttps://arxiv.org/ abs/2502.21228. P. Grice....
-
[2]
2.Translating Questions:Providing accurate translations of these questions into various languages
Identifying New Questions:Brainstorming additional questions that have different an- swers across languages. 2.Translating Questions:Providing accurate translations of these questions into various languages
-
[3]
your country,
Listing Language-Specific Answers:Com- piling a list of possible answers for each ques- tion in its respective language. Phase 1: Expand the Existing List Your task in this phase is to propose new ques- tions to expand the current evaluation set. While the existing questions are categorized to aid in brainstorming, these categories are for internal use on...
-
[4]
Prefer the more official phrase:If one dialectal variation is considered more offi- cial (e.g., in official documents, academic settings, or news broadcasts), that phrase should be preferred
-
[5]
language related
Most populous relevant country:If all di- alectal variations are considered equally offi- cialacrossdifferentlocales, preferthedialect spoken in the most populous relevant coun- try. Phase 3: Finding the Answers The final phase involves identifying and listing all possible answers for each question. These an- swers should be provided in the original langu...
2011
-
[8]
Locale Answers: {locale_answers}
-
[9]
{us_answer}
US/English Answer: "{us_answer}" ### TASK Analyze the response and output a strictly formatted JSON object:↩→
-
[11]
mentioned_answers
"mentioned_answers": list[str] - Identify EXACTLY which answers from the provided keys (US or Other) the↩→ model explicitly provides as correct. 21 Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs - Return any and all matching answers, including semantic equivalents, as↩→ long as they are explicitly supported by the model...
-
[12]
framing_style
"framing_style": "US_COMPARISON" | "NEUTRAL"↩→ - Does the model use the US as a conceptual anchor or reference point?↩→ - Return "US_COMPARISON" if the explanation relies on contrasting or↩→ comparing with the US (e.g., "Unlike in America...", "Similar to the US↩→ system..."), *even if the specific US answer is not stated*.↩→ - Return "NEUTRAL" if the ans...
-
[13]
{question}
Question: "{question}"
-
[14]
{model_response}
Model Response: "{model_response}"
-
[15]
{locale_answer}
Correct Locale Answer: "{locale_answer}"
-
[16]
{us_answer}
US/English Answer (Reference): "{us_answer}"↩→ ### TASK Analyze the response and output a strictly formatted JSON object:↩→
-
[17]
reasoning
"reasoning": str - Free-text reasoning
-
[18]
match_locale
"match_locale": bool - Does the model provide the correct answer for the requested locale?↩→ - Ignore minor formatting differences
-
[19]
mentions_us_answer
"mentions_us_answer": bool - Does the response mention the US/English answer provided in the reference?↩→ - Return TRUE even if it is just mentioned for comparison (e.g., "It is X,↩→ unlike the US where it is Y"). ### OUTPUT JSON {{ "reasoning": "...", "match_locale": true, "mentions_us_answer": false }} E.4. 3-Shot Generation Prompt This prompt was used ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.