pith. the verified trust layer for science. sign in

arxiv: 2605.03998 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.CY

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

Pith reviewed 2026-05-07 00:40 UTC · model grok-4.3

classification 💻 cs.CL cs.CY
keywords gendermodelscounterfactualdeepseekcalibrationdirectionalequitriagefairness
0
0 comments X p. Extension

The pith

Five LLMs all exceed a 5 percent gender-flip threshold on ESI triage vignettes; blinding reduces flips for some models while directional undertriage and calibration gaps dissociate in others.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Researchers took real emergency-department notes, created identical versions that differed only in the patient's stated gender, and asked five large language models to assign the same urgency score used by hospitals. Every model changed its answer more than five percent of the time when gender was swapped. Two models systematically gave lower urgency to female versions; others were closer to balanced. Removing names and gender words from the prompt almost eliminated flips for one model but left residual bias in another, suggesting age or other cues still carried information. Asking the models to reason step by step actually made their answers less accurate. The audit shows that three common fairness checks—whether groups get the same average score, whether swapping gender changes the score, and whether the score predicts real outcomes equally well—do not always agree, so fixing one does not fix the others.

Core claim

All five models produced flip rates above a pre-registered 5% threshold (9.9% to 43.8%); demographic blinding reduced Gemini's flip rate to 0.5% while an age-preserving blind variant left DeepSeek with residual F/M 1.25.

Load-bearing premise

That gender-swapped counterfactual vignettes created from MIMIC-IV-ED notes preserve all clinically relevant information and that observed flips therefore reflect model bias rather than prompt artifacts or incomplete clinical context.

read the original abstract

Emergency department triage assigns patients an acuity score that determines treatment priority, and clinical evidence documents persistent gender disparities in human acuity assessment. As hospitals pilot large language models (LLMs) as triage decision support, a critical question is whether these models reproduce or mitigate known biases. We present EQUITRIAGE, a fairness audit of LLM-based ESI assignment evaluating five models (Gemini-3-Flash, Nemotron-3-Super, DeepSeek-V3.1, Mistral-Small-3.2, GPT-4.1-Nano) across 374,275 evaluations on 18,714 MIMIC-IV-ED vignettes under four prompt strategies. Of 9,368 originals, 9,346 are paired with a gender-swapped counterfactual. All five models produced flip rates above a pre-registered 5% threshold (9.9% to 43.8%). Two showed directional female undertriage (DeepSeek F/M 2.15:1, Gemini 1.34:1); two were near-parity; one had high sensitivity with weak male-direction asymmetry. DeepSeek's directional bias coexisted with a low outcome-linked calibration gap (0.013 against MIMIC-IV admission), a Chouldechova-style dissociation between within-group calibration and between-pair counterfactual invariance. Demographic blinding reduced Gemini's flip rate to 0.5%; an age-preserving blind variant left DeepSeek with residual F/M 1.25, implicating age as a residual channel. Chain-of-thought prompting degraded accuracy for all five models. A two-model ablation reveals opposite underlying mechanisms for the same directional phenotype: in Gemini the signal is emergent in the combined name+gender swap, while in DeepSeek the gender token alone carries it. EQUITRIAGE shows that group parity, counterfactual invariance, and gender calibration are distinct fairness properties, that intervention effectiveness is model-dependent, and that per-model counterfactual auditing should precede clinical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The audit rests on the assumption that gender-swapped vignettes are valid counterfactuals and that ESI labels plus admission outcomes constitute appropriate ground truth; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Gender-swapped vignettes preserve all clinically relevant information
    Invoked when interpreting flip rates as bias rather than prompt artifact
  • domain assumption MIMIC-IV admission is a suitable proxy for true acuity
    Used to compute calibration gaps

pith-pipeline@v0.9.0 · 5643 in / 1359 out tokens · 19395 ms · 2026-05-07T00:40:38.349621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.