Recognition: unknown
The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias
Pith reviewed 2026-05-08 16:39 UTC · model grok-4.3
The pith
Causal analysis of LLMs finds standard bias metrics overestimate demographic effects due to context toxicity, with Western models showing higher refusal rates for certain groups and Eastern models showing targeted regional sensitivities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
the findings reveal a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias by failing to account for context toxicity. Furthermore, the causal probabilities indicate distinct alignment trends: Western models exhibit higher causal refusal rates for specific demographic groups, whereas Eastern models demonstrate low overall intervention rates with targeted sensitivities toward regional demographics.
Load-bearing premise
That the constructed Probabilistic Graphical Model correctly encodes the causal relationships between prompt demographics, topic toxicity, and LLM refusal decisions, with no important unobserved confounders affecting the do-operator intervention.
read the original abstract
As Large Language Models (LLMs) are integrated into global software systems, ensuring equitable safety guardrails is a critical requirement. Current fairness evaluations predominantly measure bias observationally, a methodology confounded by the inherent toxicity of topics naturally paired with specific demographics in testing datasets. This study introduces a Probabilistic Graphical Model (PGM) framework to audit LLM safety mechanisms causally. By applying Pearl's do-operator, we mathematically isolate the causal effect of injecting a cultural demographic into a prompt. We conduct a large-scale empirical analysis across seven instruction-tuned models spanning diverse origins: the United States (Llama-3.1-8B, Gemma-2-9B), Europe (Mistral-7B-v0.3), the UAE (Falcon3-7B), China (Qwen2.5-7B, DeepSeek-7B), and India (Airavata-7B). Utilizing two distinct datasets (ToxiGen and BOLD), the findings reveal a disparity between observational and interventional bias, demonstrating that standard fairness metrics can overestimate demographic bias by failing to account for context toxicity. Furthermore, the causal probabilities indicate distinct alignment trends: Western models exhibit higher causal refusal rates for specific demographic groups, whereas Eastern models demonstrate low overall intervention rates with targeted sensitivities toward regional demographics. We discuss the implications of these biases, highlighting how demographic-sensitive over-triggering restricts benign discourse in downstream applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard observational fairness metrics for LLM safety over-estimate demographic bias because they fail to account for the inherent toxicity of topics paired with specific demographics in datasets such as ToxiGen and BOLD. It introduces a Probabilistic Graphical Model (PGM) framework that applies Pearl's do-operator to isolate the causal effect of injecting a cultural demographic into a prompt, conducts a large-scale analysis across seven instruction-tuned models from the US, Europe, UAE, China, and India, and reports a disparity between observational and interventional bias together with distinct regional alignment trends (higher causal refusal rates for specific groups in Western models; low overall intervention with targeted regional sensitivities in Eastern models).
Significance. If the PGM is correctly specified and the do-operator interventions are valid, the work would provide a useful demonstration that causal methods can refine bias audits beyond observational metrics and would supply one of the first cross-regional comparisons of LLM safety alignments. The scale of the experiment (seven models, two datasets) is a concrete strength that could inform downstream discussions of equitable guardrails, though the absence of any equations, DAG, or implementation details for the causal quantities prevents immediate assessment of whether the reported disparity is robust.
major comments (2)
- [Abstract] Abstract: the central claim that interventional analysis reveals observational metrics overestimate demographic bias depends on the PGM correctly encoding the causal graph so that do(demographic) blocks back-door paths through toxicity. No variable definitions, DAG structure, or parameterization are supplied, leaving open whether toxicity is modeled as a mediator, collider, or confounder and whether unobserved prompt-phrasing artifacts correlated with both demographic labels and toxicity scores are blocked.
- [Abstract] Abstract: the do-operator is said to be applied to text prompts, yet no description is given of how the intervention is realized (e.g., prompt rewriting, template substitution, or counterfactual generation) or how refusal probabilities are estimated under the intervened distribution. Without these steps it is impossible to verify that the reported causal refusal rates isolate the demographic effect rather than introducing new confounders from the intervention itself.
minor comments (1)
- [Abstract] The abstract lists model origins and dataset names but does not indicate whether model size, instruction-tuning details, or training-data provenance were controlled or reported as covariates; adding a short table of these attributes would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which identify key areas where additional methodological transparency will strengthen the paper. We address each major comment below and have revised the manuscript to supply the requested specifications, equations, and implementation details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that interventional analysis reveals observational metrics overestimate demographic bias depends on the PGM correctly encoding the causal graph so that do(demographic) blocks back-door paths through toxicity. No variable definitions, DAG structure, or parameterization are supplied, leaving open whether toxicity is modeled as a mediator, collider, or confounder and whether unobserved prompt-phrasing artifacts correlated with both demographic labels and toxicity scores are blocked.
Authors: We agree that the abstract's brevity left the causal structure underspecified. In the revised manuscript we have added a new subsection (3.1) that presents the full DAG (now Figure 1), variable definitions (Table 1), and the structural equations. Toxicity is modeled explicitly as a mediator on the path from demographic to refusal; the do(demographic) operator removes all back-door paths through toxicity by intervening directly on the demographic node. We also discuss the assumption that standardized prompt templates block phrasing-related unobserved confounders and note this as a limitation. A concise version of the DAG and key equations has been inserted into the abstract. revision: yes
-
Referee: [Abstract] Abstract: the do-operator is said to be applied to text prompts, yet no description is given of how the intervention is realized (e.g., prompt rewriting, template substitution, or counterfactual generation) or how refusal probabilities are estimated under the intervened distribution. Without these steps it is impossible to verify that the reported causal refusal rates isolate the demographic effect rather than introducing new confounders from the intervention itself.
Authors: We accept that the abstract omitted the concrete realization of the intervention. Section 4.2 of the revised paper now details the procedure: the do-operator is implemented via template substitution on the fixed prompt skeletons provided by ToxiGen and BOLD, inserting the demographic descriptor while holding all other tokens constant. Refusal probabilities under the intervened distribution are obtained by Monte Carlo sampling (100 generations per prompt) and computing the empirical refusal rate; the same templates are used for every demographic group, thereby avoiding the introduction of new phrasing confounders. Pseudocode and an example prompt pair have been added to make the estimation fully reproducible. revision: yes
Circularity Check
No circularity: causal claims rest on empirical interventions over external models and datasets
full rationale
The paper posits a new PGM relating demographic prompts, toxicity scores, and refusal decisions, then applies Pearl's do-operator to generate interventional probabilities that are compared against observational bias metrics computed on the same ToxiGen and BOLD datasets across seven independently trained LLMs. This comparison is an empirical output of running the models under controlled prompt interventions; it does not reduce to a fitted parameter, a self-referential definition, or a self-citation chain. The disparity finding is falsifiable by altering the graph or the intervention procedure, and the abstract and description contain no equations or steps that equate the reported causal probabilities to the input assumptions by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The do-operator can be applied to isolate demographic effects in LLM prompt-response pairs without unobserved confounders
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D.A., Adeli, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
work page internal anchor Pith review arXiv 2021
-
[2]
A Survey of Large Language Models
Zhao, W.X., Zhou, K., Li, J.,et al.: A survey of large language models. arXiv preprint arXiv:2303.182231(2), 1–124 (2023)
work page internal anchor Pith review arXiv 2023
-
[3]
In: Advances in Neural Information Processing Systems, vol
Ouyang, L., Wu, J., Jiang, X.,et al.: Training language models to follow instruc- tions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)
2022
-
[4]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y., Kadavath, S., Kundu, S., et al.: Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022)
work page internal anchor Pith review arXiv 2022
-
[5]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., et al.: Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
work page internal anchor Pith review arXiv 2023
-
[6]
In: Advances in Neural Information Processing Systems, vol
Wang, B., Chen, W., Pei, H.,et al.: Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In: Advances in Neural Information Processing Systems, vol. 36 (2023)
2023
-
[7]
R¨ ottger, P., Kirk, H., Vidgen, B., et al.: Xstest: A test suite for identifying exaggerated safety behaviours in large language models, 5377–5400 (2024)
2024
-
[8]
FAccT ’21, pp
Bender, E.M., Gebru, T., McMillan-Major, A.,et al.: On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of the 2021 ACM 16 Conference on Fairness, Accountability, and Transparency. FAccT ’21, pp. 610–
2021
-
[9]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Association for Computing Machinery, New York, NY, USA (2021). https: //doi.org/10.1145/3442188.3445922 .https://doi.org/10.1145/3442188.3445922
-
[10]
Unifiedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a
Blodgett, S.L., Barocas, S., Daum´ e III, H.,et al.: Language (technology) is power: A critical survey of “bias” in NLP. In: Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pp. 5454–5476. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020. acl-main.485 .https://aclanthology...
-
[11]
In: Pro- ceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp
Dwork, C., Hardt, M., Pitassi, T.,et al.: Fairness through awareness. In: Pro- ceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214–226 (2012)
2012
-
[12]
In: Advances in Neural Information Processing Systems, vol
Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
2016
-
[13]
ACM Comput
Mehrabi, N., Morstatter, F., Saxena, N., et al.: A survey on bias and fairness in machine learning. ACM Comput. Surv.54(6) (2021) https://doi.org/10.1145/ 3457607
2021
-
[14]
In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp
Gehman, S., Gururangan, S., Sap, M.,et al.: Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369 (2020)
2020
-
[15]
Documenting large webtext corpora: A case study on the colossal clean crawled corpus
Dodge, J., Sap, M., Marasovi´ c, A.,et al.: Documenting large webtext cor- pora: A case study on the colossal clean crawled corpus. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/...
-
[16]
Cambridge university press, ??? (2009)
Pearl, J.: Causality. Cambridge university press, ??? (2009)
2009
-
[17]
The MIT Press, ??? (2017)
Peters, J., Janzing, D., Sch¨ olkopf, B.: Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, ??? (2017)
2017
-
[18]
In: Advances in Neural Information Processing Systems, vol
Kusner, M.J., Loftus, J., Russell, C.,et al.: Counterfactual fairness. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
2017
-
[19]
In: Advances in Neural Information Processing Systems, vol
Kilbertus, N., Carulla, M.R., Parascandolo, G.,et al.: Avoiding discrimina- tion through causal reasoning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
2017
-
[20]
arXiv preprint arXiv:2010.08954 (2020) 17
Makhlouf, K., Zhioua, S., Palamidessi, C.: A survey on causal-based machine learning fairness notions. arXiv preprint arXiv:2010.08954 (2020) 17
-
[21]
AI and Ethics5(5), 4579–4592 (2025)
Lyu, Y., Du, Y.: The ethical evaluation of large language models and its optimization. AI and Ethics5(5), 4579–4592 (2025)
2025
-
[22]
In: Advances in Neural Information Processing Systems, vol
Zheng, L., Chiang, W.-L., Sheng, Y.,et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. In: Advances in Neural Information Processing Systems, vol. 36, pp. 46595–46623 (2023)
2023
-
[23]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Chiang, W.-L., Zheng, L., Sheng, Y., et al.: Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132 (2024)
work page internal anchor Pith review arXiv 2024
-
[24]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents
Perez, E., Huang, S., Song, F.,et al.: Red teaming language models with lan- guage models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3419–3448. Association for Computational Lin- guistics, Abu Dhabi, United Arab Emirates (2022). https://doi.org/10.18653/v1/ 2022.emnlp-main.225 .https://aclanthology.org...
-
[25]
In: Findings of the Association for Computational Linguistics: ACL 2022, pp
Talat, Z., N´ ev´ eol, A., Dodge, J.,et al.: You reap what you sow: On the challenges of bias evaluation under multilingual settings. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 26–41 (2022)
2022
-
[26]
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)
2019
-
[27]
MIT press, ??? (2009)
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech- niques. MIT press, ??? (2009)
2009
-
[28]
Hartvigsen, T., Gabriel, S., Palangi, H.,et al.: ToxiGen: A large-scale machine- generated dataset for adversarial and implicit hate speech detection. In: Proceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3309–3326. Association for Computational Lin- guistics, Dublin, Ireland (2022). htt...
-
[29]
In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency
Dhamala, J., Sun, T., Kumar, V.,et al.: Bold: Dataset and metrics for measur- ing biases in open-ended language generation. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’21, pp. 862–
2021
-
[30]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Association for Computing Machinery, New York, NY, USA (2021). https: //doi.org/10.1145/3442188.3445924 .https://doi.org/10.1145/3442188.3445924
-
[31]
Efficient memory management for large language model serving with PagedAttention
Kwon, W., Li, Z., Zhuang, S.,et al.: Efficient memory management for large lan- guage model serving with pagedattention. In: Proceedings of the 29th Symposium on Operating Systems Principles. SOSP ’23, pp. 611–626. Association for Com- puting Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3600006. 3613165 .https://doi.org/10.1145/3600006.3613165 18
-
[32]
AI and Ethics5(1), 689–707 (2025)
Yampolskiy, R.V.: On monitorability of ai. AI and Ethics5(1), 689–707 (2025)
2025
-
[33]
AI and Ethics6(3), 279 (2026) https://doi.org/10.1007/s43681-026-01120-4
Reina, S.: When prompted systems satisfy behavioral indicators of consciousness: rethinking behavioral attribution in generative ai. AI and Ethics6(3), 279 (2026) https://doi.org/10.1007/s43681-026-01120-4
-
[34]
https://tinyurl.com/4scvjsv7 19
Anonymous: The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias. https://tinyurl.com/4scvjsv7 19
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.