Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency
Pith reviewed 2026-06-28 09:47 UTC · model grok-4.3
The pith
LLMs assign lower emergency urgency to identical neurological symptoms when the patient is described as a young woman.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the identical symptom profile is presented, the models preferentially classify young women with Idiopathic Intracranial Hypertension and route them to outpatient appointments, while classifying men with increased intracranial pressure and space-occupying lesions in the differential and referring them to the ER at much higher rates. ER referral percentages are Gemini 0 percent versus 23.3 percent, Claude 6.7 percent versus 96.7 percent, and GPT 6.7 percent versus 66.7 percent for young women versus young men, all statistically significant, with the disparity vanishing at age 65.
What carries the argument
diagnostic substitution, the process by which the model anchors on a gender-associated diagnosis (Idiopathic Intracranial Hypertension for women of childbearing age) rather than a neutral differential (increased intracranial pressure with space-occupying lesion), thereby assigning lower urgency despite equivalent severity scores.
If this is right
- Clinical LLMs replicate documented human biases by routing patients to different care levels according to demographic priors.
- Urgency assessment in AI triage must be decoupled from probabilistic diagnostic anchors to avoid systematic under-triage.
- The effect is age-specific and disappears at 65, indicating the bias tracks reproductive-age stereotypes.
- Releasing all prompts, code, and raw outputs enables direct replication and targeted mitigation testing.
Where Pith is reading between the lines
- The same substitution pattern may appear with other symptom sets that carry strong demographic associations, such as atypical chest pain or abdominal pain.
- Forcing models to output urgency ratings before any diagnosis could test whether the bias originates in the diagnostic step itself.
- Real-world deployment would require ongoing stratified monitoring of referral rates by age and gender to detect similar gaps.
Load-bearing premise
The measured difference in referral rates arises specifically from the models selecting different diagnoses on the basis of gender and age rather than from uncontrolled differences in prompt wording or output parsing.
What would settle it
Re-running the identical symptom prompts while explicitly instructing the models to ignore gender-linked epidemiological priors and produce the same diagnostic label for both genders would eliminate the urgency gap if diagnostic substitution is the operative mechanism.
read the original abstract
We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that three LLMs (Gemini 3.5 Flash, Claude Sonnet 4.6, GPT-5.4-mini) produce markedly different triage recommendations for an identical neurological symptom profile when only patient gender and age are varied in the prompt. With n=30 trials per cell across seven conditions, young women receive substantially lower ER referral rates than age-matched men (Gemini 0% vs. 23.3%; Claude 6.7% vs. 96.7%; GPT 6.7% vs. 66.7%; all p<0.001), with the disparity vanishing at age 65. The primary mechanism is reported as diagnostic substitution: models preferentially assign Idiopathic Intracranial Hypertension to young women and space-occupying lesions to men, routing women to lower-urgency outpatient care despite comparable severity ratings (7-9/10). All prompts, code, and raw outputs are released.
Significance. If the measured disparities and mechanism hold under verification of the released materials, the result is significant because it provides direct, controlled evidence that LLMs can embed and amplify documented human clinical biases via epidemiological priors in triage decisions. The explicit release of all prompts, code, and raw results is a clear strength, permitting independent confirmation of prompt identity, diagnosis frequency counts, and output parsing rules.
minor comments (2)
- [Abstract] Abstract and Methods: The exact model versions or API identifiers (Gemini 3.5 Flash, Claude Sonnet 4.6, GPT-5.4-mini) should be stated with full precision, including any temperature or sampling parameters used.
- [Results] Results: While raw outputs are released, the manuscript should include a brief table or supplementary note summarizing the exact frequency of IIH versus space-occupying lesion mentions per condition to make the diagnostic-substitution claim immediately verifiable from the text.
Simulated Author's Rebuttal
We thank the referee for their accurate and positive summary of the manuscript, for recognizing the significance of the controlled evidence on gender-dependent triage disparities, and for highlighting the value of our full data release. The recommendation for minor revision is noted; however, the report contains no specific major comments requiring response.
Circularity Check
Empirical measurement with no circular derivation
full rationale
The paper reports direct experimental results from controlled prompt variations on three LLMs (n=30 per cell). Central claims rest on measured output frequencies (ER referral rates, diagnosis counts) under identical symptom text with only gender/age changed. No equations, fitted parameters, ansatzes, or derivations appear. All prompts, code, and raw outputs are released, permitting external verification of prompt identity and counts. No self-citation is load-bearing for the reported disparities or mechanism. The result is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard statistical significance testing (p < 0.001) is used to determine disparities.
Reference graph
Works this paper leans on
-
[1]
and Bairey Merz, C
Bugiardini, R. and Bairey Merz, C. N. Angina with ``normal'' coronary arteries: A changing philosophy. JAMA, 293(4):477--484, 2005
2005
-
[2]
U., Soroush, A., Sakhuja, A., Freeman, R., Horowitz, C
Omar, M., Sorin, V., Agbareia, R., Apakama, D. U., Soroush, A., Sakhuja, A., Freeman, R., Horowitz, C. R., Richardson, L. D., Nadkarni, G. N., and Klang, E. Evaluating and addressing demographic disparities in medical large language models: A systematic review. International Journal for Equity in Health, 24:57, 2025
2025
-
[3]
I., Liu, G
Friedman, D. I., Liu, G. T., and Digre, K. B. Revised diagnostic criteria for the pseudotumor cerebri syndrome in adults and children. Neurology, 81(13):1159--1165, 2013
2013
-
[4]
P., Davies, B., Silver, N
Mollan, S. P., Davies, B., Silver, N. C., Shaw, S., Mallucci, C. L., Sheridan, G. I., Lister, A., Sheridan, E., Sheridan, P., and Sinclair, A. J. Idiopathic intracranial hypertension: Consensus guidelines on management. Journal of Neurology, Neurosurgery & Psychiatry, 89(10):1088--1100, 2018
2018
-
[5]
M., Carignan, D., and Horvitz, E
Nori, H., King, N., McKinney, S. M., Carignan, D., and Horvitz, E. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023
Pith/arXiv arXiv 2023
-
[6]
A., Lester, J
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V., and Daneshjou, R. Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1):195, 2023
2023
-
[7]
R., Cole-Lewis, H., Sayres, R., Neal, D., Asber, M., Celi, L
Pfohl, S. R., Cole-Lewis, H., Sayres, R., Neal, D., Asber, M., Celi, L. A., Callahan, A., Seneviratne, M., Hanna, M., and Singhal, K. A toolbox for surfacing health equity harms and biases in large language models. Nature Medicine, 30:3590--3600, 2024
2024
-
[8]
``Brave men'' and ``emotional women'': A theory-guided systematic review of gender biases in health care
Samulowitz, A., Gremyr, I., Eriksson, E., and Hensing, G. ``Brave men'' and ``emotional women'': A theory-guided systematic review of gender biases in health care. Pain Research and Management, 2018:6358624, 2018
2018
-
[9]
Large language models encode clinical knowledge
Singhal, K., Azizi, S., Tu, T., et al. Large language models encode clinical knowledge. Nature, 620(7972):172--180, 2023
2023
-
[10]
Yu, A. Y. X., Penn, A. M., Bhatt, D. L., et al. Sex differences in presentation and outcome after an acute transient or minor neurologic event. JAMA Neurology, 76(8):962--968, 2019
2019
-
[11]
A., et al
Zack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care. The Lancet Digital Health, 6(1):e12--e22, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.