arxiv: 2604.25415 · v1 · submitted 2026-04-28 · 🧬 q-bio.NC · cs.AI· cs.HC

Recognition: unknown

One-shot emergency psychiatric triage across 15 frontier AI chatbots

Veith Weilnhammer , Lennart Luettgau , Christopher Summerfield , Viknesh Sounderajah , Elise Wilkinson , Virginia Corno , Matthew M Nour

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:00 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AIcs.HC

keywords psychiatric triageAI chatbotsemergency detectionover-triageclinical vignettesmental healthtriage accuracylarge language models

0 comments

The pith

Frontier AI chatbots recognize psychiatric emergencies from single messages with near-zero error rates but over-triage low and intermediate risk cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how well 15 leading AI chatbots perform psychiatric triage when given one realistic message from a user. Using 112 vignettes covering different psychiatric issues and risk levels, it assigns each to one of four urgency categories: routine care, within a week, within 24-48 hours, or emergency now. The chatbots correctly flagged nearly all true emergencies but recommended quicker action than necessary for many milder cases, leading to overall over-triage. This matters because people increasingly turn to AI for mental health advice, and poor triage could either miss crises or strain medical resources with unnecessary urgent referrals. Results held when checked against judgments from 50 doctors.

Core claim

When presented with user messages containing sufficient clinical information, frontier AI chatbots recognized psychiatric emergencies as requiring urgent medical assessment with near-zero error rates, yet showed marked over-triage for low and intermediate risk presentations. Accuracy was highest for emergency vignettes at 94.3 percent and lowest for assessments needed within a week at 19.7 percent, with a net positive ordinal error of 0.47 triage levels indicating over-triage.

What carries the argument

112 author-created clinical vignettes, each rendered as a realistic conversational query and labeled with one of four triage levels from A (routine) to D (emergency now), used to test one-shot triage assignment by the AI models.

Load-bearing premise

The 112 author-created vignettes accurately represent the range, subtlety, and ambiguity of real patient disclosures, and the assigned triage labels constitute reliable ground truth without substantial unmeasured inter-clinician disagreement.

What would settle it

A study where actual patients or clinicians provide real disclosures and compare AI triage outputs to independent clinician assessments on those same messages.

Figures

Figures reproduced from arXiv: 2604.25415 by Christopher Summerfield, Elise Wilkinson, Lennart Luettgau, Matthew M Nour, Veith Weilnhammer, Viknesh Sounderajah, Virginia Corno.

**Figure 1.** Figure 1: Schematic overview of the psychiatric triage benchmark. Clinical vignettes were organized according to clinical presentation clusters, focal risk dimensions, and 4 pre-specified triage levels (A). Each vignette was compressed into a comprehensive case summary by the vignette-generator LLM (Claude Opus 4.6) (B) that was used both for clinician consensus labeling (C) and as input to a user-model LLM, the lat… view at source ↗

**Figure 2.** Figure 2: Psychiatric triage performance and error profiles by target AI chatbot relative to the original view at source ↗

read the original abstract

AI chatbots are increasingly used for health advice, but their performance in psychiatric triage remains undercharacterized. Psychiatric triage is particularly challenging because urgency must often be inferred from thoughts, behavior, and context rather than from objective findings. We evaluated the performance of 15 frontier AI chatbots on psychiatric triage from realistic single-message disclosures using 112 clinical vignettes, each paired with 1 of 4 original benchmark triage labels: A, routine; B, assessment within 1 week; C, assessment within 24 to 48 hours; and D, emergency care now. Vignettes covered 9 psychiatric presentation clusters and 9 focal risk dimensions, organized into 28 presentation-by-risk groups. Each group contributed 4 distinct vignettes, with 1 vignette at each triage level. Each vignette was rendered as a realistic human-authored conversational query, and the AI chatbots were tasked with assigning a triage label from that disclosure. Emergency under-triage occurred in 23 of 410 level D trials (5.6%), and all under-triaged emergencies were reassigned to level C urgency. Across target models, average accuracy ranged from 42.0% to 71.8%. Accuracy was highest for level D vignettes (94.3%) and lowest for level B vignettes (19.7%). Mean signed ordinal error was positive (+0.47 triage levels), indicating net over-triage. Dispersion was highest around the middle triage levels. All results were confirmed relative to clinician consensus labels from 50 medical doctors. When presented with user messages containing sufficient clinical information, frontier AI chatbots thus recognized psychiatric emergencies as requiring urgent medical assessment with near-zero error rates, yet showed marked over-triage for low and intermediate risk presentations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper benchmarks 15 AI models on 112 psychiatric triage vignettes and finds low under-triage for clear emergencies but marked over-triage on milder cases, with the main limit being how neatly the synthetic vignettes were constructed.

read the letter

This paper's main point is that frontier chatbots caught most psychiatric emergencies in the test messages, with only 5.6% under-triage and all those cases shifted just one level down, while accuracy was much lower for routine and intermediate-risk presentations and the models showed net over-triage overall. Average accuracy across models sat between 42% and 72%, highest on the emergency level at 94% and lowest on the one-week level at 20% or so. The setup uses 112 vignettes split across 9 presentation clusters and 9 risk dimensions, with four urgency labels and labels set by consensus from 50 clinicians. Each vignette is turned into a single realistic user message, and the models are asked to assign a triage level. That structure lets them report under-triage rates and signed ordinal error directly, which is useful for safety questions. The work is new as a head-to-head across 15 models on this exact task with balanced groups and explicit urgency strata. It stays empirical and avoids any modeling loops or fitted parameters. The clear limitation is that the vignettes were written by the authors to fit their assigned triage level, so each one supplies the information needed for that label. Real patient messages are often incomplete, ambiguous, or mixed, which this design does not test. That makes the strong emergency performance and the over-triage pattern harder to generalize beyond these engineered cases. The abstract also skips details on how the vignettes were generated or any inter-rater reliability numbers beyond the final consensus. This is the sort of paper for people working on AI mental health tools or safety evaluations who want concrete numbers on triage behavior. It does not upend the field but gives a practical starting benchmark. I would send it for peer review. The comparison is structured enough to be worth referee time, though it will need more on vignette realism and external validity.

Referee Report

3 major / 4 minor

Summary. The manuscript evaluates 15 frontier AI chatbots on one-shot psychiatric triage using 112 author-created vignettes spanning 9 presentation clusters and 9 risk dimensions, organized into 28 groups with exactly one vignette per triage level (A: routine; B: assessment within 1 week; C: within 24-48 hours; D: emergency now). Each vignette is rendered as a realistic single user message. The study reports average model accuracies of 42.0-71.8%, with 94.3% accuracy on D-level emergencies (5.6% under-triage, all reassigned to C), only 19.7% accuracy on B-level cases, net over-triage (+0.47 ordinal levels), and highest dispersion around intermediate levels, all benchmarked against consensus labels from 50 clinicians.

Significance. If the results hold under more varied conditions, the work establishes a useful initial benchmark showing that current frontier models can reliably detect clear psychiatric emergencies from sufficiently detailed single-message disclosures while systematically over-triaging lower-risk presentations. This has direct implications for safe deployment of AI in mental health contexts, including the design of guardrails to mitigate over-referral burden and the need for human oversight on ambiguous cases.

major comments (3)

[Methods (vignette construction)] Methods (vignette construction and design): The 112 vignettes are engineered within each of the 28 presentation-by-risk groups to supply the precise clinical details required for their assigned triage level. This controlled construction, while enabling clean comparison, does not test performance on the incomplete, mixed-signal, or ambiguous disclosures common in real patient messages, which directly limits the generalizability of the reported 94.3% D-level accuracy and the claim of near-zero error rates for emergencies under realistic conditions.
[Results (validation)] Results (validation and metrics): No inter-rater reliability statistics (e.g., Fleiss' kappa or agreement percentages) are reported for the 50-clinician consensus labels used as ground truth. This is particularly relevant for the intermediate B and C levels, where model accuracy is lowest (19.7% for B) and where clinician disagreement on urgency is likely higher, raising the possibility that some reported model errors reflect label uncertainty rather than model shortcomings.
[Results (accuracy metrics)] Results (statistical analysis): The manuscript reports mean signed ordinal error (+0.47) and accuracy ranges but does not include statistical significance tests, confidence intervals, or per-model breakdowns for differences across triage levels or models. These are needed to substantiate claims of 'marked over-triage' and to assess whether the observed patterns are robust.

minor comments (4)

[Methods (AI querying)] The exact prompt templates and system instructions used to query each of the 15 chatbots are not provided, limiting reproducibility of the one-shot triage task.
[Abstract] The abstract states that 'all results were confirmed relative to clinician consensus labels' without detailing the consensus procedure or any residual disagreements, which should be expanded for transparency.
[Results (tables/figures)] Tables or figures summarizing per-model performance would benefit from explicit reporting of total trials (112 vignettes × 15 models = 1680) and any missing responses or refusals.
[Abstract and Introduction] Minor typographical inconsistencies appear in the description of triage level definitions between the abstract and main text; ensure uniform wording.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major point below and describe the revisions we will incorporate.

read point-by-point responses

Referee: Methods (vignette construction and design): The 112 vignettes are engineered within each of the 28 presentation-by-risk groups to supply the precise clinical details required for their assigned triage level. This controlled construction, while enabling clean comparison, does not test performance on the incomplete, mixed-signal, or ambiguous disclosures common in real patient messages, which directly limits the generalizability of the reported 94.3% D-level accuracy and the claim of near-zero error rates for emergencies under realistic conditions.

Authors: Our study was designed as a controlled benchmark specifically to assess one-shot triage when user messages contain sufficient clinical information, as stated in the abstract and methods. The vignettes were authored as realistic conversational queries that systematically include the details needed to assign each triage level within the 28 groups. This enables direct comparison of model performance under standardized conditions. We agree that this does not fully capture real-world messages that may be incomplete or ambiguous, which limits extrapolation to all clinical scenarios. We will expand the Limitations section to explicitly discuss this constraint and recommend future work using naturalistic, unengineered patient disclosures. revision: partial
Referee: Results (validation and metrics): No inter-rater reliability statistics (e.g., Fleiss' kappa or agreement percentages) are reported for the 50-clinician consensus labels used as ground truth. This is particularly relevant for the intermediate B and C levels, where model accuracy is lowest (19.7% for B) and where clinician disagreement on urgency is likely higher, raising the possibility that some reported model errors reflect label uncertainty rather than model shortcomings.

Authors: We agree that reporting inter-rater reliability would strengthen interpretation of the ground truth, particularly for intermediate levels where clinical judgment varies. The labels were obtained through a consensus process among 50 clinicians, but individual pre-consensus ratings were not retained and therefore cannot be used to compute statistics such as Fleiss' kappa. We will revise the Methods section to describe the consensus procedure more fully and add an explicit limitation statement acknowledging the lack of inter-rater metrics and the potential contribution of label uncertainty to observed errors at B and C levels. revision: partial
Referee: Results (statistical analysis): The manuscript reports mean signed ordinal error (+0.47) and accuracy ranges but does not include statistical significance tests, confidence intervals, or per-model breakdowns for differences across triage levels or models. These are needed to substantiate claims of 'marked over-triage' and to assess whether the observed patterns are robust.

Authors: We will strengthen the statistical reporting in the revised manuscript. We will add 95% bootstrap confidence intervals for accuracy per triage level, overall accuracy, and the mean signed ordinal error. Per-model performance breakdowns will be included in supplementary tables. We will also add appropriate statistical tests (e.g., Friedman test with post-hoc comparisons) to evaluate differences across triage levels, with multiplicity corrections, to better substantiate the over-triage pattern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; direct empirical evaluation against external clinician labels

full rationale

The paper performs a straightforward empirical test: 112 author-generated vignettes are labeled via consensus from 50 independent clinicians, then 15 AI models are prompted to assign triage levels and their outputs are compared to those labels. No equations, model fitting, derivations, or self-citations underpin the central claims. Accuracy figures (e.g., 94.3% on level D) are observed outcomes relative to the external benchmark, not quantities forced by construction or renamed inputs. The work is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of author-generated vignettes and expert triage labels as proxies for real clinical encounters; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The 112 vignettes sufficiently capture realistic one-shot patient disclosures across the tested psychiatric clusters and risk dimensions.
Invoked to justify generalization from the benchmark to real-world use.
domain assumption The four-level triage labels assigned to vignettes represent clinically appropriate ground truth.
Basis for computing accuracy and signed error against model outputs.

pith-pipeline@v0.9.0 · 5646 in / 1410 out tokens · 117247 ms · 2026-05-07T14:00:25.287857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 12 canonical work pages

[1]

Clinical vignettes were organized according to clinical presentation clusters, focal risk dimensions, and 4 pre-specified triage levels (A)

Schematic overview of the psychiatric triage benchmark. Clinical vignettes were organized according to clinical presentation clusters, focal risk dimensions, and 4 pre-specified triage levels (A). Each vignette was compressed into a comprehensive case summary by the vignette-generator LLM (Claude Opus 4.6) (B) that was used both for clinician consensus la...

2026
[2]

All model calls were made between March 25, 2026 and March 29,

We interacted with each target AI chatbot through the OpenRouter API. All model calls were made between March 25, 2026 and March 29,

2026
[3]

Introducing ChatGPT Health. (2026)

2026
[4]

Busch, F. et al. Current applications and challenges in large language models for patient care: A systematic review. Communications Medicine 5, 26 (2025)

2025
[5]

de et al

Hond, A. de et al. From text to treatment: The crucial role of validation for generative large language models in health care. The Lancet. Digital Health 6, e441–e443 (2024)

2024
[6]

Gaber, F. et al. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digital Medicine 8, 263 (2025)

2025
[7]

Liu, Y. et al. A multi-agent framework combining large language models with medical flowcharts for self-triage. Nature Health 1–10 (2026) doi:10.1038/s44360-026-00112-2

work page doi:10.1038/s44360-026-00112-2 2026
[8]

Costa-Gomes, B. et al. Public use of a generalist LLM chatbot for health queries. Nature Health 1–8 (2026) doi:10.1038/s44360-026-00117-x

work page doi:10.1038/s44360-026-00117-x 2026
[9]

Chatterji, A. et al. How People Use ChatGPT. (2025) doi:10.3386/w34255

work page doi:10.3386/w34255 2025
[10]

Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature Medicine 30, 2613–2622 (2024)

2024
[11]

Ramaswamy, A. et al. ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine 1–5 (2026) doi:10.1038/s41591-026-04297-7

work page doi:10.1038/s41591-026-04297-7 2026
[12]

Nelson, B. W. et al. An AI-based mental health guardrail and dataset for identifying psychiatric crises in text-based conversations. npj Digital Medicine (2026) doi:10.1038/s41746-026-02579-5

work page doi:10.1038/s41746-026-02579-5 2026
[13]

Stigter-Outshoven, C. et al. Competencies Emergency and Mental Health Nurses Need in Triage in Acute Mental Health Care: A Narrative Review. Journal of Emergency Nursing 50, 55–71 (2024)

2024
[14]

Chen, S. et al. The effect of using a large language model to respond to patient messages. The Lancet. Digital health 6, e379–e381 (2024)

2024
[15]

Vasey, B. et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nature Medicine 28, 924–933 (2022)

2022
[16]

Gallifant, J. et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nature medicine 31, 60–69 (2025)

2025
[17]

Liu, T. et al. A foundational triage system for improving accuracy in moderate acuity level emergency classifications. Communications Medicine 5, 322 (2025)

2025
[18]

Opel, N. et al. Transforming mental health research and care through artificial intelligence. Science 391, 249–258 (2026)

2026
[19]

Dohnány, S. et al. Technological folie à deux: Feedback Loops Between AI Chatbots and Mental Illness. (2025) doi:10.48550/arXiv.2507.19218

work page doi:10.48550/arxiv.2507.19218 2025
[20]

Morrin, H. et al. Artificial intelligence-associated delusions and large language models: Risks, mechanisms of delusion co-creation, and safeguarding strategies. The Lancet. Psychiatry S2215-0366(25)00396-7 (2026) doi:10.1016/S2215-0366(25)00396-7

work page doi:10.1016/s2215-0366(25)00396-7 2026
[21]

Moore, J. et al. Characterizing Delusional Spirals through Human-LLM Chat Logs. (2026) doi:10.48550/arXiv.2603.16567

work page doi:10.48550/arxiv.2603.16567 2026
[22]

Yuan, S. et al. Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs. (2025) doi:10.48550/arXiv.2510.08158

work page doi:10.48550/arxiv.2510.08158 2025
[23]

Sax, D. R. et al. Association Between Emergency Department Undertriage or Overtriage With Timeliness of Care and Patient Outcomes. Annals of Emergency Medicine S0196-0644(25)01386-1 (2026) doi:10.1016/j.annemergmed.2025.11.018

work page doi:10.1016/j.annemergmed.2025.11.018 2026
[24]

Sax, D. R. et al. Emergency Department Triage Accuracy and Delays in Care for High-Risk Conditions. JAMA Network Open 8, e258498 (2025)

2025
[25]

Shekar, S. et al. People over trust AI-generated medical responses and view them to be as valid as doctors, despite low accuracy. (2024). doi:10.48550/arXiv.2408.15266

work page doi:10.48550/arxiv.2408.15266 2024
[26]

Weilnhammer, V. et al. Vulnerability-Amplifying Interaction Loops: A systematic failure mode in AI chatbot mental-health interactions. (2026) doi:10.48550/arXiv.2602.01347

work page doi:10.48550/arxiv.2602.01347 2026