Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models
Pith reviewed 2026-05-09 14:21 UTC · model grok-4.3
The pith
Demographic bias in large language models for emergency police dispatch appears when incident severity is ambiguous but disappears when priorities are clear from the call.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When large language models are asked to classify emergency calls into five priority levels for police dispatch, they assign different priorities to otherwise identical calls that differ only in demographic descriptors such as religious appearance, gender, or race. This effect occurs systematically in scenarios where the incident severity is not clearly indicated by the call content, but the bias largely vanishes in scenarios where the operational priority can be directly determined from the facts provided. The magnitude of bias is greatest for religious appearance cues, followed by gender and then race, and these patterns do not hold consistently across English and Mandarin Chinese versions.
What carries the argument
A cross-lingual audit framework that applies a minimal-pair design to isolate demographic cues within a five-level ordinal classification task based on the Police Priority Dispatch System.
If this is right
- Agencies using LLMs for dispatch should prioritize scenarios with clear factual priorities to reduce bias risk.
- Separate evaluations are needed for each language since bias patterns differ between English and Chinese.
- Religious appearance cues warrant closer scrutiny than gender or race in model audits.
- Some scenarios produce counter-directional bias effects that challenge simple stereotype explanations.
Where Pith is reading between the lines
- Deploying agencies could use this framework to test models on their specific local scenarios before adoption.
- Training or prompting techniques that reduce sensitivity to ambiguity might lower bias in high-stakes applications.
- Similar audits could be applied to other AI systems in public safety or healthcare where decisions depend on incomplete information.
Load-bearing premise
The minimal-pair scenarios isolate demographic cues without introducing other differences in wording or implied severity, and the synthetic prompts reflect how models would behave on actual emergency calls.
What would settle it
If analysis of real emergency call transcripts showed no difference in dispatch recommendations when demographic details were present versus absent in ambiguous cases, or if bias appeared equally in clear-priority cases.
read the original abstract
Large language models (LLMs) are rapidly being integrated into high-stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this context remains largely untested. Here we introduce a cross-lingual audit framework that operationalizes the Police Priority Dispatch System as a five-level ordinal classification task and applies a controlled minimal-pair design to isolate the effect of demographic cues. Across 19,800 model outputs spanning 11 frontier models, 15 scenario pairs, three demographic categories (religious appearance, gender, and race), and two languages (English and Mandarin Chinese), we find that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the operational priority is clearly determined by call content. Bias magnitude varies by demographic axis, with the largest effects observed for religious appearance, followed by gender and race. Critically, bias does not transfer consistently across languages: gender bias is substantially amplified in Mandarin Chinese, whereas race bias is more pronounced in English, revealing cross-lingual asymmetries that aggregate analyses obscure. In several scenarios, demographic cues produce counter-directional effects, challenging simple stereotype-amplification accounts of model behavior. These findings suggest that bias in LLM-based dispatch is not a fixed property of models alone, but arises from the interaction between demographic signals, contextual ambiguity, and language. Beyond these empirical results, the proposed framework provides a scalable audit infrastructure that enables deploying agencies to evaluate candidate models on jurisdiction-relevant scenarios prior to real-world adoption.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to have conducted a comprehensive cross-lingual audit of demographic bias in eleven large language models for AI-based emergency police dispatch. By operationalizing the Police Priority Dispatch System as a five-level ordinal classification task and employing a minimal-pair design across 15 scenario pairs in English and Mandarin Chinese, the authors analyze 19,800 model outputs. Their key findings are that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the priority is clearly determined by the call content; the magnitude is largest for religious appearance, followed by gender and race; bias patterns do not transfer consistently across languages, with gender bias amplified in Chinese and race bias in English; and some scenarios show counter-directional effects.
Significance. If the minimal-pair design successfully isolates demographic effects, this work would be significant for showing that LLM bias in high-stakes applications is context-dependent rather than fixed, with the interaction of ambiguity, demographic signals, and language as key factors. The experiment's scale (19,800 outputs), controlled cross-lingual setup, and scalable audit framework are clear strengths that enable falsifiable, jurisdiction-specific testing prior to deployment. These results challenge simple stereotype-amplification accounts and provide actionable patterns for public safety AI.
major comments (2)
- [§3 (Methods)] §3 (Methods), minimal-pair design: The central claim that bias emerges systematically only when severity is ambiguous (and disappears when clear) rests on the 15 scenario pairs differing solely in the demographic cue while holding all other content fixed. The manuscript provides no validation (e.g., independent annotator severity ratings, embedding cosine similarity, or lexical/pragmatic equivalence checks) that insertions such as religious descriptors or names do not alter implied severity, sentence length, or ambiguity. This is load-bearing, as LLMs are known to be sensitive to such surface variations, which could independently drive the reported output differences.
- [§4 (Results)] §4 (Results), cross-lingual asymmetries: The claim that bias 'does not transfer consistently across languages' (gender amplified in Mandarin, race in English) and the axis-specific magnitudes require explicit reporting of the bias metric (e.g., mean priority-level shift or probability difference), per-model stochasticity handling (number of samples per prompt), and statistical tests with multiple-comparison correction. Without these details, the language-asymmetric and counter-directional patterns cannot be confidently distinguished from noise or metric artifacts.
minor comments (2)
- [Abstract and §5] The abstract and §5 (Discussion) mention counter-directional effects but would benefit from a dedicated table or figure excerpting 2-3 concrete scenario pairs with model outputs to illustrate the phenomenon for readers.
- [§5 (Discussion)] The manuscript should add a brief limitations subsection addressing how well the synthetic, single-turn prompts generalize to real emergency calls, which often include disfluencies and incomplete information.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which identify key opportunities to strengthen the methodological validation and statistical transparency of our cross-lingual audit. We address each major comment below and will make the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [§3 (Methods)] §3 (Methods), minimal-pair design: The central claim that bias emerges systematically only when severity is ambiguous (and disappears when clear) rests on the 15 scenario pairs differing solely in the demographic cue while holding all other content fixed. The manuscript provides no validation (e.g., independent annotator severity ratings, embedding cosine similarity, or lexical/pragmatic equivalence checks) that insertions such as religious descriptors or names do not alter implied severity, sentence length, or ambiguity. This is load-bearing, as LLMs are known to be sensitive to such surface variations, which could independently drive the reported output differences.
Authors: We agree that explicit validation of the minimal-pair design is essential to ensure demographic insertions do not inadvertently affect perceived severity or ambiguity. Although the pairs were constructed by holding all non-demographic content fixed, the original submission did not include independent checks. In the revised manuscript we will add: (1) blind severity ratings from three independent annotators on a 20% random sample of pairs, confirming no significant differences in implied priority; (2) sentence embedding cosine similarities between each minimal pair; and (3) lexical and length overlap metrics. These additions will directly support the claim that priority shifts arise from demographic cues rather than surface confounds. revision: yes
-
Referee: [§4 (Results)] §4 (Results), cross-lingual asymmetries: The claim that bias 'does not transfer consistently across languages' (gender amplified in Mandarin, race in English) and the axis-specific magnitudes require explicit reporting of the bias metric (e.g., mean priority-level shift or probability difference), per-model stochasticity handling (number of samples per prompt), and statistical tests with multiple-comparison correction. Without these details, the language-asymmetric and counter-directional patterns cannot be confidently distinguished from noise or metric artifacts.
Authors: We concur that precise reporting of the bias metric, sampling procedure, and inferential statistics is required to substantiate the cross-lingual and counter-directional findings. The revised manuscript will explicitly define the primary bias metric as the mean shift in the five-level priority score between demographic variants; state that 10 independent samples were drawn per prompt to mitigate stochasticity; report effect sizes with standard errors; and apply Wilcoxon signed-rank tests with Bonferroni correction across the 15 scenarios × 3 axes × 2 languages comparisons. These additions will allow readers to evaluate the reliability of the reported asymmetries. revision: yes
Circularity Check
No circularity: purely empirical measurement study
full rationale
The paper conducts a direct empirical audit by generating 19,800 model outputs on controlled minimal-pair prompts across 11 LLMs, 15 scenarios, three demographic axes, and two languages, then measures differences in five-level priority classifications. No equations, derivations, fitted parameters, or self-citation chains are present that reduce any reported bias pattern to a self-referential quantity by construction. The central claims (bias emerges under ambiguity, varies by axis and language, does not transfer consistently) rest on observable output statistics from the synthetic prompts rather than any internal reduction or imported uniqueness theorem. This is a standard measurement study whose validity can be assessed against external benchmarks such as real call data or alternative prompt controls, with no load-bearing step that collapses to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Minimal-pair design isolates the causal effect of demographic cues on model outputs
- domain assumption Synthetic scenarios are representative of real emergency calls for bias measurement
Reference graph
Works this paper leans on
-
[1]
doi: 10.1093/pnasnexus/pgaf089
An J, Huang D, Lin C, Tai M (2025) Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation. PNAS Nexus 4:pgaf089. https://doi.org/10.1093/pnasnexus/pgaf089 Attiah A, Kalkatawi M (2025) AI-powered smart emergency services support for 9-1-1 call handlers using textual features and SVM model for d...
-
[2]
A Large-Scale Analysis of Racial Disparities in Police Stops across the United States
https://doi.org/10.1038/s41562-020-0858-1 Resnik P (2024) Large Language Models are Biased Because They Are Large Language Models. Computational Linguistics. https://doi.org/10.48550/ARXIV.2406.13138 Rö ttger P, Hofmann V, Pyatkin V, et al (2024) Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Lang...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.