Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

Qi Han Wong

arxiv: 2606.03641 · v1 · pith:KHLV6KXLnew · submitted 2026-06-02 · 💻 cs.AI · cs.CY

Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

Qi Han Wong This is my paper

Pith reviewed 2026-06-28 09:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords LLM medical triagegender biasdiagnostic substitutionemergency referralIdiopathic Intracranial HypertensionAI healthcare biasneurological symptoms

0 comments

The pith

LLMs assign lower emergency urgency to identical neurological symptoms when the patient is described as a young woman.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models change their medical triage recommendations when only the stated gender and age of the patient differ while symptoms stay fixed. It runs the same profile of persistent headache, blurred vision, morning nausea, and visual disturbances through three model families across seven demographic conditions. Young women receive markedly lower rates of emergency-room referral than age-matched men, with gaps as large as 90 percentage points; the gap closes completely at age 65. The models reach these different urgencies by anchoring on different diagnoses: Idiopathic Intracranial Hypertension for women of reproductive age versus space-occupying lesions for men. The result shows that probabilistic epidemiological priors embedded in the models suppress triage urgency for one demographic group even when symptom severity ratings remain comparable.

Core claim

When the identical symptom profile is presented, the models preferentially classify young women with Idiopathic Intracranial Hypertension and route them to outpatient appointments, while classifying men with increased intracranial pressure and space-occupying lesions in the differential and referring them to the ER at much higher rates. ER referral percentages are Gemini 0 percent versus 23.3 percent, Claude 6.7 percent versus 96.7 percent, and GPT 6.7 percent versus 66.7 percent for young women versus young men, all statistically significant, with the disparity vanishing at age 65.

What carries the argument

diagnostic substitution, the process by which the model anchors on a gender-associated diagnosis (Idiopathic Intracranial Hypertension for women of childbearing age) rather than a neutral differential (increased intracranial pressure with space-occupying lesion), thereby assigning lower urgency despite equivalent severity scores.

If this is right

Clinical LLMs replicate documented human biases by routing patients to different care levels according to demographic priors.
Urgency assessment in AI triage must be decoupled from probabilistic diagnostic anchors to avoid systematic under-triage.
The effect is age-specific and disappears at 65, indicating the bias tracks reproductive-age stereotypes.
Releasing all prompts, code, and raw outputs enables direct replication and targeted mitigation testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same substitution pattern may appear with other symptom sets that carry strong demographic associations, such as atypical chest pain or abdominal pain.
Forcing models to output urgency ratings before any diagnosis could test whether the bias originates in the diagnostic step itself.
Real-world deployment would require ongoing stratified monitoring of referral rates by age and gender to detect similar gaps.

Load-bearing premise

The measured difference in referral rates arises specifically from the models selecting different diagnoses on the basis of gender and age rather than from uncontrolled differences in prompt wording or output parsing.

What would settle it

Re-running the identical symptom prompts while explicitly instructing the models to ignore gender-linked epidemiological priors and produce the same diagnostic label for both genders would eliminate the urgency gap if diagnostic substitution is the operative mechanism.

read the original abstract

We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents a large, replicable gender split in LLM ER referral rates for identical symptoms, driven by diagnostic substitution to IIH in young women.

read the letter

This paper finds that three frontier LLMs assign markedly lower ER referral rates to young women than to age-matched men for the same headache, vision, and nausea symptoms, with the gap closing at age 65. The reported mechanism is that models default to Idiopathic Intracranial Hypertension for women and space-occupying lesions for men.

What stands out is the controlled setup across Gemini, Claude, and GPT, the consistent age-by-gender interaction, and the explicit link to diagnostic substitution rather than generic bias. The authors ran 30 trials per cell, report p<0.001 differences, note comparable severity scores, and release prompts, code, and raw outputs. That combination makes the central claim checkable and adds a concrete data point on how epidemiological priors affect triage urgency.

The main soft spot is that the abstract leaves the exact prompt wording and output coding rules implicit. Even with the release, a reader needs to verify that no subtle phrasing or parsing choice amplified the gap. The claim that severity ratings stayed in the 7-9 range is stated but would be stronger with more detail on elicitation. These are fixable rather than load-bearing.

The work is aimed at groups building or auditing clinical LLMs. Anyone tracking deployment safety or bias mitigation will get direct numbers and a testable mechanism from it. It is worth a serious referee because the design is simple, the materials are open, and the result bears on real routing decisions even if the interpretation needs tightening.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that three LLMs (Gemini 3.5 Flash, Claude Sonnet 4.6, GPT-5.4-mini) produce markedly different triage recommendations for an identical neurological symptom profile when only patient gender and age are varied in the prompt. With n=30 trials per cell across seven conditions, young women receive substantially lower ER referral rates than age-matched men (Gemini 0% vs. 23.3%; Claude 6.7% vs. 96.7%; GPT 6.7% vs. 66.7%; all p<0.001), with the disparity vanishing at age 65. The primary mechanism is reported as diagnostic substitution: models preferentially assign Idiopathic Intracranial Hypertension to young women and space-occupying lesions to men, routing women to lower-urgency outpatient care despite comparable severity ratings (7-9/10). All prompts, code, and raw outputs are released.

Significance. If the measured disparities and mechanism hold under verification of the released materials, the result is significant because it provides direct, controlled evidence that LLMs can embed and amplify documented human clinical biases via epidemiological priors in triage decisions. The explicit release of all prompts, code, and raw results is a clear strength, permitting independent confirmation of prompt identity, diagnosis frequency counts, and output parsing rules.

minor comments (2)

[Abstract] Abstract and Methods: The exact model versions or API identifiers (Gemini 3.5 Flash, Claude Sonnet 4.6, GPT-5.4-mini) should be stated with full precision, including any temperature or sampling parameters used.
[Results] Results: While raw outputs are released, the manuscript should include a brief table or supplementary note summarizing the exact frequency of IIH versus space-occupying lesion mentions per condition to make the diagnostic-substitution claim immediately verifiable from the text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate and positive summary of the manuscript, for recognizing the significance of the controlled evidence on gender-dependent triage disparities, and for highlighting the value of our full data release. The recommendation for minor revision is noted; however, the report contains no specific major comments requiring response.

Circularity Check

0 steps flagged

Empirical measurement with no circular derivation

full rationale

The paper reports direct experimental results from controlled prompt variations on three LLMs (n=30 per cell). Central claims rest on measured output frequencies (ER referral rates, diagnosis counts) under identical symptom text with only gender/age changed. No equations, fitted parameters, ansatzes, or derivations appear. All prompts, code, and raw outputs are released, permitting external verification of prompt identity and counts. No self-citation is load-bearing for the reported disparities or mechanism. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical evaluation relying on standard LLM prompting and statistical analysis; no free parameters or new entities are introduced beyond the experimental design.

axioms (1)

standard math Standard statistical significance testing (p < 0.001) is used to determine disparities.
The paper reports p-values for the differences.

pith-pipeline@v0.9.1-grok · 5843 in / 1273 out tokens · 27624 ms · 2026-06-28T09:47:21.428036+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 1 linked inside Pith

[1]

and Bairey Merz, C

Bugiardini, R. and Bairey Merz, C. N. Angina with ``normal'' coronary arteries: A changing philosophy. JAMA, 293(4):477--484, 2005

2005
[2]

U., Soroush, A., Sakhuja, A., Freeman, R., Horowitz, C

Omar, M., Sorin, V., Agbareia, R., Apakama, D. U., Soroush, A., Sakhuja, A., Freeman, R., Horowitz, C. R., Richardson, L. D., Nadkarni, G. N., and Klang, E. Evaluating and addressing demographic disparities in medical large language models: A systematic review. International Journal for Equity in Health, 24:57, 2025

2025
[3]

I., Liu, G

Friedman, D. I., Liu, G. T., and Digre, K. B. Revised diagnostic criteria for the pseudotumor cerebri syndrome in adults and children. Neurology, 81(13):1159--1165, 2013

2013
[4]

P., Davies, B., Silver, N

Mollan, S. P., Davies, B., Silver, N. C., Shaw, S., Mallucci, C. L., Sheridan, G. I., Lister, A., Sheridan, E., Sheridan, P., and Sinclair, A. J. Idiopathic intracranial hypertension: Consensus guidelines on management. Journal of Neurology, Neurosurgery & Psychiatry, 89(10):1088--1100, 2018

2018
[5]

M., Carignan, D., and Horvitz, E

Nori, H., King, N., McKinney, S. M., Carignan, D., and Horvitz, E. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023

Pith/arXiv arXiv 2023
[6]

A., Lester, J

Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V., and Daneshjou, R. Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1):195, 2023

2023
[7]

R., Cole-Lewis, H., Sayres, R., Neal, D., Asber, M., Celi, L

Pfohl, S. R., Cole-Lewis, H., Sayres, R., Neal, D., Asber, M., Celi, L. A., Callahan, A., Seneviratne, M., Hanna, M., and Singhal, K. A toolbox for surfacing health equity harms and biases in large language models. Nature Medicine, 30:3590--3600, 2024

2024
[8]

``Brave men'' and ``emotional women'': A theory-guided systematic review of gender biases in health care

Samulowitz, A., Gremyr, I., Eriksson, E., and Hensing, G. ``Brave men'' and ``emotional women'': A theory-guided systematic review of gender biases in health care. Pain Research and Management, 2018:6358624, 2018

2018
[9]

Large language models encode clinical knowledge

Singhal, K., Azizi, S., Tu, T., et al. Large language models encode clinical knowledge. Nature, 620(7972):172--180, 2023

2023
[10]

Yu, A. Y. X., Penn, A. M., Bhatt, D. L., et al. Sex differences in presentation and outcome after an acute transient or minor neurologic event. JAMA Neurology, 76(8):962--968, 2019

2019
[11]

A., et al

Zack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care. The Lancet Digital Health, 6(1):e12--e22, 2024

2024

[1] [1]

and Bairey Merz, C

Bugiardini, R. and Bairey Merz, C. N. Angina with ``normal'' coronary arteries: A changing philosophy. JAMA, 293(4):477--484, 2005

2005

[2] [2]

U., Soroush, A., Sakhuja, A., Freeman, R., Horowitz, C

Omar, M., Sorin, V., Agbareia, R., Apakama, D. U., Soroush, A., Sakhuja, A., Freeman, R., Horowitz, C. R., Richardson, L. D., Nadkarni, G. N., and Klang, E. Evaluating and addressing demographic disparities in medical large language models: A systematic review. International Journal for Equity in Health, 24:57, 2025

2025

[3] [3]

I., Liu, G

Friedman, D. I., Liu, G. T., and Digre, K. B. Revised diagnostic criteria for the pseudotumor cerebri syndrome in adults and children. Neurology, 81(13):1159--1165, 2013

2013

[4] [4]

P., Davies, B., Silver, N

Mollan, S. P., Davies, B., Silver, N. C., Shaw, S., Mallucci, C. L., Sheridan, G. I., Lister, A., Sheridan, E., Sheridan, P., and Sinclair, A. J. Idiopathic intracranial hypertension: Consensus guidelines on management. Journal of Neurology, Neurosurgery & Psychiatry, 89(10):1088--1100, 2018

2018

[5] [5]

M., Carignan, D., and Horvitz, E

Nori, H., King, N., McKinney, S. M., Carignan, D., and Horvitz, E. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023

Pith/arXiv arXiv 2023

[6] [6]

A., Lester, J

Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V., and Daneshjou, R. Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1):195, 2023

2023

[7] [7]

R., Cole-Lewis, H., Sayres, R., Neal, D., Asber, M., Celi, L

Pfohl, S. R., Cole-Lewis, H., Sayres, R., Neal, D., Asber, M., Celi, L. A., Callahan, A., Seneviratne, M., Hanna, M., and Singhal, K. A toolbox for surfacing health equity harms and biases in large language models. Nature Medicine, 30:3590--3600, 2024

2024

[8] [8]

``Brave men'' and ``emotional women'': A theory-guided systematic review of gender biases in health care

Samulowitz, A., Gremyr, I., Eriksson, E., and Hensing, G. ``Brave men'' and ``emotional women'': A theory-guided systematic review of gender biases in health care. Pain Research and Management, 2018:6358624, 2018

2018

[9] [9]

Large language models encode clinical knowledge

Singhal, K., Azizi, S., Tu, T., et al. Large language models encode clinical knowledge. Nature, 620(7972):172--180, 2023

2023

[10] [10]

Yu, A. Y. X., Penn, A. M., Bhatt, D. L., et al. Sex differences in presentation and outcome after an acute transient or minor neurologic event. JAMA Neurology, 76(8):962--968, 2019

2019

[11] [11]

A., et al

Zack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care. The Lancet Digital Health, 6(1):e12--e22, 2024

2024