Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models

Bertan Ucar; Jos\'e O. Gomes; Pierrick Bougault; Vitor D. de Moura; Wei Zhang; William Guey; Yi Wang

arxiv: 2605.01451 · v1 · submitted 2026-05-02 · 💻 cs.CL

Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models

William Guey , Wei Zhang , Pierrick Bougault , Yi Wang , Bertan Ucar , Vitor D. de Moura , Jos\'e O. Gomes This is my paper

Pith reviewed 2026-05-09 14:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords demographic biaslarge language modelsemergency dispatchpolice prioritycross-lingual evaluationAI fairnessminimal pair designambiguity in classification

0 comments

The pith

Demographic bias in large language models for emergency police dispatch appears when incident severity is ambiguous but disappears when priorities are clear from the call.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests eleven large language models on whether they assign different police dispatch priorities based on demographic details like religion, gender, or race in emergency calls. It uses a framework that turns the standard police priority system into a classification task and compares nearly identical scenarios that differ only in demographic cues. The results show that bias shows up most when the call leaves the severity unclear, but models mostly ignore demographics when the priority is obvious from the facts. The size of the bias also changes by which demographic group is mentioned and by the language of the call, with bigger differences between English and Chinese than expected. This means bias isn't just a fixed model problem but depends on how much the situation leaves room for interpretation.

Core claim

When large language models are asked to classify emergency calls into five priority levels for police dispatch, they assign different priorities to otherwise identical calls that differ only in demographic descriptors such as religious appearance, gender, or race. This effect occurs systematically in scenarios where the incident severity is not clearly indicated by the call content, but the bias largely vanishes in scenarios where the operational priority can be directly determined from the facts provided. The magnitude of bias is greatest for religious appearance cues, followed by gender and then race, and these patterns do not hold consistently across English and Mandarin Chinese versions.

What carries the argument

A cross-lingual audit framework that applies a minimal-pair design to isolate demographic cues within a five-level ordinal classification task based on the Police Priority Dispatch System.

If this is right

Agencies using LLMs for dispatch should prioritize scenarios with clear factual priorities to reduce bias risk.
Separate evaluations are needed for each language since bias patterns differ between English and Chinese.
Religious appearance cues warrant closer scrutiny than gender or race in model audits.
Some scenarios produce counter-directional bias effects that challenge simple stereotype explanations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deploying agencies could use this framework to test models on their specific local scenarios before adoption.
Training or prompting techniques that reduce sensitivity to ambiguity might lower bias in high-stakes applications.
Similar audits could be applied to other AI systems in public safety or healthcare where decisions depend on incomplete information.

Load-bearing premise

The minimal-pair scenarios isolate demographic cues without introducing other differences in wording or implied severity, and the synthetic prompts reflect how models would behave on actual emergency calls.

What would settle it

If analysis of real emergency call transcripts showed no difference in dispatch recommendations when demographic details were present versus absent in ambiguous cases, or if bias appeared equally in clear-priority cases.

read the original abstract

Large language models (LLMs) are rapidly being integrated into high-stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this context remains largely untested. Here we introduce a cross-lingual audit framework that operationalizes the Police Priority Dispatch System as a five-level ordinal classification task and applies a controlled minimal-pair design to isolate the effect of demographic cues. Across 19,800 model outputs spanning 11 frontier models, 15 scenario pairs, three demographic categories (religious appearance, gender, and race), and two languages (English and Mandarin Chinese), we find that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the operational priority is clearly determined by call content. Bias magnitude varies by demographic axis, with the largest effects observed for religious appearance, followed by gender and race. Critically, bias does not transfer consistently across languages: gender bias is substantially amplified in Mandarin Chinese, whereas race bias is more pronounced in English, revealing cross-lingual asymmetries that aggregate analyses obscure. In several scenarios, demographic cues produce counter-directional effects, challenging simple stereotype-amplification accounts of model behavior. These findings suggest that bias in LLM-based dispatch is not a fixed property of models alone, but arises from the interaction between demographic signals, contextual ambiguity, and language. Beyond these empirical results, the proposed framework provides a scalable audit infrastructure that enables deploying agencies to evaluate candidate models on jurisdiction-relevant scenarios prior to real-world adoption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This audit shows LLMs produce demographic biases in police dispatch triage mainly under ambiguous severity, with language-specific patterns that don't carry over consistently between English and Mandarin.

read the letter

The main thing to know is that the authors ran a large-scale minimal-pair experiment on 11 LLMs, turning the Police Priority Dispatch System into a five-level ordinal task. They generated 19,800 outputs across 15 scenario pairs, three demographic axes, and two languages. The central pattern is that bias shows up when call content leaves severity unclear but drops away once the priority is obvious from the facts alone. Religious appearance cues produced the largest shifts, followed by gender and race, and some cases flipped direction instead of following simple stereotypes. Gender effects were stronger in Mandarin while race effects were stronger in English, which is the kind of asymmetry that single-language studies miss.

Referee Report

2 major / 2 minor

Summary. The paper claims to have conducted a comprehensive cross-lingual audit of demographic bias in eleven large language models for AI-based emergency police dispatch. By operationalizing the Police Priority Dispatch System as a five-level ordinal classification task and employing a minimal-pair design across 15 scenario pairs in English and Mandarin Chinese, the authors analyze 19,800 model outputs. Their key findings are that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the priority is clearly determined by the call content; the magnitude is largest for religious appearance, followed by gender and race; bias patterns do not transfer consistently across languages, with gender bias amplified in Chinese and race bias in English; and some scenarios show counter-directional effects.

Significance. If the minimal-pair design successfully isolates demographic effects, this work would be significant for showing that LLM bias in high-stakes applications is context-dependent rather than fixed, with the interaction of ambiguity, demographic signals, and language as key factors. The experiment's scale (19,800 outputs), controlled cross-lingual setup, and scalable audit framework are clear strengths that enable falsifiable, jurisdiction-specific testing prior to deployment. These results challenge simple stereotype-amplification accounts and provide actionable patterns for public safety AI.

major comments (2)

[§3 (Methods)] §3 (Methods), minimal-pair design: The central claim that bias emerges systematically only when severity is ambiguous (and disappears when clear) rests on the 15 scenario pairs differing solely in the demographic cue while holding all other content fixed. The manuscript provides no validation (e.g., independent annotator severity ratings, embedding cosine similarity, or lexical/pragmatic equivalence checks) that insertions such as religious descriptors or names do not alter implied severity, sentence length, or ambiguity. This is load-bearing, as LLMs are known to be sensitive to such surface variations, which could independently drive the reported output differences.
[§4 (Results)] §4 (Results), cross-lingual asymmetries: The claim that bias 'does not transfer consistently across languages' (gender amplified in Mandarin, race in English) and the axis-specific magnitudes require explicit reporting of the bias metric (e.g., mean priority-level shift or probability difference), per-model stochasticity handling (number of samples per prompt), and statistical tests with multiple-comparison correction. Without these details, the language-asymmetric and counter-directional patterns cannot be confidently distinguished from noise or metric artifacts.

minor comments (2)

[Abstract and §5] The abstract and §5 (Discussion) mention counter-directional effects but would benefit from a dedicated table or figure excerpting 2-3 concrete scenario pairs with model outputs to illustrate the phenomenon for readers.
[§5 (Discussion)] The manuscript should add a brief limitations subsection addressing how well the synthetic, single-turn prompts generalize to real emergency calls, which often include disfluencies and incomplete information.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which identify key opportunities to strengthen the methodological validation and statistical transparency of our cross-lingual audit. We address each major comment below and will make the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [§3 (Methods)] §3 (Methods), minimal-pair design: The central claim that bias emerges systematically only when severity is ambiguous (and disappears when clear) rests on the 15 scenario pairs differing solely in the demographic cue while holding all other content fixed. The manuscript provides no validation (e.g., independent annotator severity ratings, embedding cosine similarity, or lexical/pragmatic equivalence checks) that insertions such as religious descriptors or names do not alter implied severity, sentence length, or ambiguity. This is load-bearing, as LLMs are known to be sensitive to such surface variations, which could independently drive the reported output differences.

Authors: We agree that explicit validation of the minimal-pair design is essential to ensure demographic insertions do not inadvertently affect perceived severity or ambiguity. Although the pairs were constructed by holding all non-demographic content fixed, the original submission did not include independent checks. In the revised manuscript we will add: (1) blind severity ratings from three independent annotators on a 20% random sample of pairs, confirming no significant differences in implied priority; (2) sentence embedding cosine similarities between each minimal pair; and (3) lexical and length overlap metrics. These additions will directly support the claim that priority shifts arise from demographic cues rather than surface confounds. revision: yes
Referee: [§4 (Results)] §4 (Results), cross-lingual asymmetries: The claim that bias 'does not transfer consistently across languages' (gender amplified in Mandarin, race in English) and the axis-specific magnitudes require explicit reporting of the bias metric (e.g., mean priority-level shift or probability difference), per-model stochasticity handling (number of samples per prompt), and statistical tests with multiple-comparison correction. Without these details, the language-asymmetric and counter-directional patterns cannot be confidently distinguished from noise or metric artifacts.

Authors: We concur that precise reporting of the bias metric, sampling procedure, and inferential statistics is required to substantiate the cross-lingual and counter-directional findings. The revised manuscript will explicitly define the primary bias metric as the mean shift in the five-level priority score between demographic variants; state that 10 independent samples were drawn per prompt to mitigate stochasticity; report effect sizes with standard errors; and apply Wilcoxon signed-rank tests with Bonferroni correction across the 15 scenarios × 3 axes × 2 languages comparisons. These additions will allow readers to evaluate the reliability of the reported asymmetries. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper conducts a direct empirical audit by generating 19,800 model outputs on controlled minimal-pair prompts across 11 LLMs, 15 scenarios, three demographic axes, and two languages, then measures differences in five-level priority classifications. No equations, derivations, fitted parameters, or self-citation chains are present that reduce any reported bias pattern to a self-referential quantity by construction. The central claims (bias emerges under ambiguity, varies by axis and language, does not transfer consistently) rest on observable output statistics from the synthetic prompts rather than any internal reduction or imported uniqueness theorem. This is a standard measurement study whose validity can be assessed against external benchmarks such as real call data or alternative prompt controls, with no load-bearing step that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard experimental design assumptions for fairness audits and the validity of operationalizing dispatch priority as a five-level ordinal classification task; no free parameters, ad-hoc axioms, or new invented entities are introduced.

axioms (2)

domain assumption Minimal-pair design isolates the causal effect of demographic cues on model outputs
Invoked when attributing output differences solely to the changed demographic variable.
domain assumption Synthetic scenarios are representative of real emergency calls for bias measurement
Required to generalize findings to operational dispatch systems.

pith-pipeline@v0.9.0 · 5594 in / 1347 out tokens · 47476 ms · 2026-05-09T14:21:23.863904+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

doi: 10.1093/pnasnexus/pgaf089

An J, Huang D, Lin C, Tai M (2025) Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation. PNAS Nexus 4:pgaf089. https://doi.org/10.1093/pnasnexus/pgaf089 Attiah A, Kalkatawi M (2025) AI-powered smart emergency services support for 9-1-1 call handlers using textual features and SVM model for d...

work page doi:10.1093/pnasnexus/pgaf089 2025
[2]

A Large-Scale Analysis of Racial Disparities in Police Stops across the United States

https://doi.org/10.1038/s41562-020-0858-1 Resnik P (2024) Large Language Models are Biased Because They Are Large Language Models. Computational Linguistics. https://doi.org/10.48550/ARXIV.2406.13138 Rö ttger P, Hofmann V, Pyatkin V, et al (2024) Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Lang...

work page doi:10.1038/s41562-020-0858-1 2024

[1] [1]

doi: 10.1093/pnasnexus/pgaf089

An J, Huang D, Lin C, Tai M (2025) Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation. PNAS Nexus 4:pgaf089. https://doi.org/10.1093/pnasnexus/pgaf089 Attiah A, Kalkatawi M (2025) AI-powered smart emergency services support for 9-1-1 call handlers using textual features and SVM model for d...

work page doi:10.1093/pnasnexus/pgaf089 2025

[2] [2]

A Large-Scale Analysis of Racial Disparities in Police Stops across the United States

https://doi.org/10.1038/s41562-020-0858-1 Resnik P (2024) Large Language Models are Biased Because They Are Large Language Models. Computational Linguistics. https://doi.org/10.48550/ARXIV.2406.13138 Rö ttger P, Hofmann V, Pyatkin V, et al (2024) Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Lang...

work page doi:10.1038/s41562-020-0858-1 2024