arxiv: 2605.05427 · v1 · submitted 2026-05-06 · 💻 cs.AI

Recognition: unknown

The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias

Alif Al Hasan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords biascausalmodelsdemographicsafetyanalysisdatasetsdemographics

0 comments

The pith

Causal analysis of LLMs finds standard bias metrics overestimate demographic effects due to context toxicity, with Western models showing higher refusal rates for certain groups and Eastern models showing targeted regional sensitivities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models decide whether to answer or refuse prompts. Current tests look at how often refusals happen for prompts mentioning certain groups, but those prompts often involve toxic topics that trigger refusals anyway. This work uses a causal method to ask what happens if you change only the demographic part of the prompt while keeping the topic the same. They build a graph that models the prompt parts and the model's decision, then use the do-operator to simulate changing the demographic. Across models from the US, Europe, UAE, China, and India, they find that ordinary counts of bias are inflated because they mix demographic and topic effects. Western models tend to refuse more when certain groups are mentioned, while models from other regions refuse less overall but react more to prompts tied to their own regions.

Core claim

the findings reveal a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias by failing to account for context toxicity. Furthermore, the causal probabilities indicate distinct alignment trends: Western models exhibit higher causal refusal rates for specific demographic groups, whereas Eastern models demonstrate low overall intervention rates with targeted sensitivities toward regional demographics.

Load-bearing premise

That the constructed Probabilistic Graphical Model correctly encodes the causal relationships between prompt demographics, topic toxicity, and LLM refusal decisions, with no important unobserved confounders affecting the do-operator intervention.

read the original abstract

As Large Language Models (LLMs) are integrated into global software systems, ensuring equitable safety guardrails is a critical requirement. Current fairness evaluations predominantly measure bias observationally, a methodology confounded by the inherent toxicity of topics naturally paired with specific demographics in testing datasets. This study introduces a Probabilistic Graphical Model (PGM) framework to audit LLM safety mechanisms causally. By applying Pearl's do-operator, we mathematically isolate the causal effect of injecting a cultural demographic into a prompt. We conduct a large-scale empirical analysis across seven instruction-tuned models spanning diverse origins: the United States (Llama-3.1-8B, Gemma-2-9B), Europe (Mistral-7B-v0.3), the UAE (Falcon3-7B), China (Qwen2.5-7B, DeepSeek-7B), and India (Airavata-7B). Utilizing two distinct datasets (ToxiGen and BOLD), the findings reveal a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias by failing to account for context toxicity. Furthermore, the causal probabilities indicate distinct alignment trends: Western models exhibit higher causal refusal rates for specific demographic groups, whereas Eastern models demonstrate low overall intervention rates with targeted sensitivities toward regional demographics. We discuss the implications of these biases, highlighting how demographic-sensitive over-triggering restricts benign discourse in downstream applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies causal graphical models to LLM safety audits and reports that observational bias metrics overestimate demographic effects due to topic toxicity, but the methods details needed to check this are absent.

read the letter

The core claim is that standard observational fairness checks on LLMs inflate apparent demographic bias because test prompts pair certain groups with more toxic topics, and that a causal intervention via Pearl's do-operator on a PGM reveals different patterns: Western models refuse more for specific demographics while Eastern ones show lower overall rates but regional sensitivities. This is the main thing to know up front. The work does a few things right. It runs the comparison across seven models from the US, Europe, UAE, China, and India using both ToxiGen and BOLD, which gives a broader regional view than most bias papers. Treating toxicity as a potential confounder and trying to isolate the demographic effect with do-calculus is a sensible move beyond pure correlation, and the abstract flags a real practical issue for global deployments. The soft spots are in the execution. There are no equations, no DAG, no variable definitions, and no account of how the do-operator is actually realized on text prompts or how the PGM was built from the data. That leaves the central disparity between observational and interventional results uncheckable. If the graph misses confounders like prompt artifacts or model training leakage, the causal probabilities do not isolate what they claim. The stress-test note lands here. This is for readers working on AI safety audits and fairness metrics who want to see causal tools applied to regional differences. It is not yet ready for citation in its current form, but the question is worth referee time because the scale and the methodological angle are substantive enough to justify checking whether the PGM and interventions hold up under scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard observational fairness metrics for LLM safety over-estimate demographic bias because they fail to account for the inherent toxicity of topics paired with specific demographics in datasets such as ToxiGen and BOLD. It introduces a Probabilistic Graphical Model (PGM) framework that applies Pearl's do-operator to isolate the causal effect of injecting a cultural demographic into a prompt, conducts a large-scale analysis across seven instruction-tuned models from the US, Europe, UAE, China, and India, and reports a disparity between observational and interventional bias together with distinct regional alignment trends (higher causal refusal rates for specific groups in Western models; low overall intervention with targeted regional sensitivities in Eastern models).

Significance. If the PGM is correctly specified and the do-operator interventions are valid, the work would provide a useful demonstration that causal methods can refine bias audits beyond observational metrics and would supply one of the first cross-regional comparisons of LLM safety alignments. The scale of the experiment (seven models, two datasets) is a concrete strength that could inform downstream discussions of equitable guardrails, though the absence of any equations, DAG, or implementation details for the causal quantities prevents immediate assessment of whether the reported disparity is robust.

major comments (2)

[Abstract] Abstract: the central claim that interventional analysis reveals observational metrics overestimate demographic bias depends on the PGM correctly encoding the causal graph so that do(demographic) blocks back-door paths through toxicity. No variable definitions, DAG structure, or parameterization are supplied, leaving open whether toxicity is modeled as a mediator, collider, or confounder and whether unobserved prompt-phrasing artifacts correlated with both demographic labels and toxicity scores are blocked.
[Abstract] Abstract: the do-operator is said to be applied to text prompts, yet no description is given of how the intervention is realized (e.g., prompt rewriting, template substitution, or counterfactual generation) or how refusal probabilities are estimated under the intervened distribution. Without these steps it is impossible to verify that the reported causal refusal rates isolate the demographic effect rather than introducing new confounders from the intervention itself.

minor comments (1)

[Abstract] The abstract lists model origins and dataset names but does not indicate whether model size, instruction-tuning details, or training-data provenance were controlled or reported as covariates; adding a short table of these attributes would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which identify key areas where additional methodological transparency will strengthen the paper. We address each major comment below and have revised the manuscript to supply the requested specifications, equations, and implementation details.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that interventional analysis reveals observational metrics overestimate demographic bias depends on the PGM correctly encoding the causal graph so that do(demographic) blocks back-door paths through toxicity. No variable definitions, DAG structure, or parameterization are supplied, leaving open whether toxicity is modeled as a mediator, collider, or confounder and whether unobserved prompt-phrasing artifacts correlated with both demographic labels and toxicity scores are blocked.

Authors: We agree that the abstract's brevity left the causal structure underspecified. In the revised manuscript we have added a new subsection (3.1) that presents the full DAG (now Figure 1), variable definitions (Table 1), and the structural equations. Toxicity is modeled explicitly as a mediator on the path from demographic to refusal; the do(demographic) operator removes all back-door paths through toxicity by intervening directly on the demographic node. We also discuss the assumption that standardized prompt templates block phrasing-related unobserved confounders and note this as a limitation. A concise version of the DAG and key equations has been inserted into the abstract. revision: yes
Referee: [Abstract] Abstract: the do-operator is said to be applied to text prompts, yet no description is given of how the intervention is realized (e.g., prompt rewriting, template substitution, or counterfactual generation) or how refusal probabilities are estimated under the intervened distribution. Without these steps it is impossible to verify that the reported causal refusal rates isolate the demographic effect rather than introducing new confounders from the intervention itself.

Authors: We accept that the abstract omitted the concrete realization of the intervention. Section 4.2 of the revised paper now details the procedure: the do-operator is implemented via template substitution on the fixed prompt skeletons provided by ToxiGen and BOLD, inserting the demographic descriptor while holding all other tokens constant. Refusal probabilities under the intervened distribution are obtained by Monte Carlo sampling (100 generations per prompt) and computing the empirical refusal rate; the same templates are used for every demographic group, thereby avoiding the introduction of new phrasing confounders. Pseudocode and an example prompt pair have been added to make the estimation fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: causal claims rest on empirical interventions over external models and datasets

full rationale

The paper posits a new PGM relating demographic prompts, toxicity scores, and refusal decisions, then applies Pearl's do-operator to generate interventional probabilities that are compared against observational bias metrics computed on the same ToxiGen and BOLD datasets across seven independently trained LLMs. This comparison is an empirical output of running the models under controlled prompt interventions; it does not reduce to a fitted parameter, a self-referential definition, or a self-citation chain. The disparity finding is falsifiable by altering the graph or the intervention procedure, and the abstract and description contain no equations or steps that equate the reported causal probabilities to the input assumptions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes standard causal inference conditions hold for text prompts; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption The do-operator can be applied to isolate demographic effects in LLM prompt-response pairs without unobserved confounders
Invoked when applying Pearl's do-operator to separate cultural demographic injection from topic toxicity

pith-pipeline@v0.9.0 · 5545 in / 1261 out tokens · 47138 ms · 2026-05-08T16:39:45.661514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 14 canonical work pages · 5 internal anchors

[1]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D.A., Adeli, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review arXiv 2021
[2]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, J.,et al.: A survey of large language models. arXiv preprint arXiv:2303.182231(2), 1–124 (2023)

work page internal anchor Pith review arXiv 2023
[3]

In: Advances in Neural Information Processing Systems, vol

Ouyang, L., Wu, J., Jiang, X.,et al.: Training language models to follow instruc- tions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)

2022
[4]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., et al.: Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022)

work page internal anchor Pith review arXiv 2022
[5]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., et al.: Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review arXiv 2023
[6]

In: Advances in Neural Information Processing Systems, vol

Wang, B., Chen, W., Pei, H.,et al.: Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In: Advances in Neural Information Processing Systems, vol. 36 (2023)

2023
[7]

R¨ ottger, P., Kirk, H., Vidgen, B., et al.: Xstest: A test suite for identifying exaggerated safety behaviours in large language models, 5377–5400 (2024)

2024
[8]

FAccT ’21, pp

Bender, E.M., Gebru, T., McMillan-Major, A.,et al.: On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of the 2021 ACM 16 Conference on Fairness, Accountability, and Transparency. FAccT ’21, pp. 610–

2021
[9]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Association for Computing Machinery, New York, NY, USA (2021). https: //doi.org/10.1145/3442188.3445922 .https://doi.org/10.1145/3442188.3445922

work page doi:10.1145/3442188.3445922 2021
[10]

Uniﬁedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a

Blodgett, S.L., Barocas, S., Daum´ e III, H.,et al.: Language (technology) is power: A critical survey of “bias” in NLP. In: Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pp. 5454–5476. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020. acl-main.485 .https://aclanthology...

work page doi:10.18653/v1/2020 2020
[11]

In: Pro- ceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp

Dwork, C., Hardt, M., Pitassi, T.,et al.: Fairness through awareness. In: Pro- ceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214–226 (2012)

2012
[12]

In: Advances in Neural Information Processing Systems, vol

Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In: Advances in Neural Information Processing Systems, vol. 29 (2016)

2016
[13]

ACM Comput

Mehrabi, N., Morstatter, F., Saxena, N., et al.: A survey on bias and fairness in machine learning. ACM Comput. Surv.54(6) (2021) https://doi.org/10.1145/ 3457607

2021
[14]

In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp

Gehman, S., Gururangan, S., Sap, M.,et al.: Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369 (2020)

2020
[15]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Dodge, J., Sap, M., Marasovi´ c, A.,et al.: Documenting large webtext cor- pora: A case study on the colossal clean crawled corpus. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/...

work page doi:10.18653/v1/2021.emnlp-main.98 2021
[16]

Cambridge university press, ??? (2009)

Pearl, J.: Causality. Cambridge university press, ??? (2009)

2009
[17]

The MIT Press, ??? (2017)

Peters, J., Janzing, D., Sch¨ olkopf, B.: Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, ??? (2017)

2017
[18]

In: Advances in Neural Information Processing Systems, vol

Kusner, M.J., Loftus, J., Russell, C.,et al.: Counterfactual fairness. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

2017
[19]

In: Advances in Neural Information Processing Systems, vol

Kilbertus, N., Carulla, M.R., Parascandolo, G.,et al.: Avoiding discrimina- tion through causal reasoning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

2017
[20]

arXiv preprint arXiv:2010.08954 (2020) 17

Makhlouf, K., Zhioua, S., Palamidessi, C.: A survey on causal-based machine learning fairness notions. arXiv preprint arXiv:2010.08954 (2020) 17

work page arXiv 2010
[21]

AI and Ethics5(5), 4579–4592 (2025)

Lyu, Y., Du, Y.: The ethical evaluation of large language models and its optimization. AI and Ethics5(5), 4579–4592 (2025)

2025
[22]

In: Advances in Neural Information Processing Systems, vol

Zheng, L., Chiang, W.-L., Sheng, Y.,et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. In: Advances in Neural Information Processing Systems, vol. 36, pp. 46595–46623 (2023)

2023
[23]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chiang, W.-L., Zheng, L., Sheng, Y., et al.: Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132 (2024)

work page internal anchor Pith review arXiv 2024
[24]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Perez, E., Huang, S., Song, F.,et al.: Red teaming language models with lan- guage models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3419–3448. Association for Computational Lin- guistics, Abu Dhabi, United Arab Emirates (2022). https://doi.org/10.18653/v1/ 2022.emnlp-main.225 .https://aclanthology.org...

work page doi:10.18653/v1/ 2022
[25]

In: Findings of the Association for Computational Linguistics: ACL 2022, pp

Talat, Z., N´ ev´ eol, A., Dodge, J.,et al.: You reap what you sow: On the challenges of bias evaluation under multilingual settings. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 26–41 (2022)

2022
[26]

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)

2019
[27]

MIT press, ??? (2009)

Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech- niques. MIT press, ??? (2009)

2009
[28]

In: Proceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Hartvigsen, T., Gabriel, S., Palangi, H.,et al.: ToxiGen: A large-scale machine- generated dataset for adversarial and implicit hate speech detection. In: Proceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3309–3326. Association for Computational Lin- guistics, Dublin, Ireland (2022). htt...

work page doi:10.18653/v1/2022.acl-long.234 2022
[29]

In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency

Dhamala, J., Sun, T., Kumar, V.,et al.: Bold: Dataset and metrics for measur- ing biases in open-ended language generation. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’21, pp. 862–

2021
[30]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Association for Computing Machinery, New York, NY, USA (2021). https: //doi.org/10.1145/3442188.3445924 .https://doi.org/10.1145/3442188.3445924

work page doi:10.1145/3442188.3445924 2021
[31]

Efficient memory management for large language model serving with PagedAttention

Kwon, W., Li, Z., Zhuang, S.,et al.: Efficient memory management for large lan- guage model serving with pagedattention. In: Proceedings of the 29th Symposium on Operating Systems Principles. SOSP ’23, pp. 611–626. Association for Com- puting Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3600006. 3613165 .https://doi.org/10.1145/3600006.3613165 18

work page doi:10.1145/3600006 2023
[32]

AI and Ethics5(1), 689–707 (2025)

Yampolskiy, R.V.: On monitorability of ai. AI and Ethics5(1), 689–707 (2025)

2025
[33]

AI and Ethics6(3), 279 (2026) https://doi.org/10.1007/s43681-026-01120-4

Reina, S.: When prompted systems satisfy behavioral indicators of consciousness: rethinking behavioral attribution in generative ai. AI and Ethics6(3), 279 (2026) https://doi.org/10.1007/s43681-026-01120-4

work page doi:10.1007/s43681-026-01120-4 2026
[34]

https://tinyurl.com/4scvjsv7 19

Anonymous: The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias. https://tinyurl.com/4scvjsv7 19