Do LLMs Triage Like Clinicians? A Dynamic Study of Outpatient Referral

Benyou Wang; Bingquan Zhang; Guangjun Yu; Jian Chang; Junying Chen; Qingying Xiao; Xiang Wan; Xiangyi Feng; Xiaoxiao Liu; Yan Hu

arxiv: 2503.08292 · v5 · pith:NXZ2ZF6Jnew · submitted 2025-03-11 · 💻 cs.CL · cs.AI

Do LLMs Triage Like Clinicians? A Dynamic Study of Outpatient Referral

Xiaoxiao Liu , Qingying Xiao , Bingquan Zhang , Junying Chen , Xiangyi Feng , Ziniu Li , Xiang Wan , Jian Chang

show 3 more authors

Guangjun Yu Yan Hu Benyou Wang

This is my paper

Pith reviewed 2026-05-23 00:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsoutpatient referraldynamic triageclinical decision-makinguncertainty reductionmulti-turn dialoguemedical artificial intelligence

0 comments

The pith

Large language models match traditional classifiers in static outpatient referral accuracy but outperform them in dynamic multi-turn settings through better questioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates outpatient referral as a dynamic process rather than a static classification task, testing LLMs against traditional classifiers in both fixed-information and interactive dialogue scenarios. In static cases with complete patient data, LLMs show only limited improvements in assigning patients to the correct hospital department. However, when allowed to engage in multi-turn conversations to gather more information, LLMs consistently perform better by posing follow-up questions that more effectively reduce uncertainty about the appropriate department. The authors conclude that LLMs' primary contribution is in enabling interactive, uncertainty-aware decision support rather than superior one-time predictions. Readers interested in medical AI applications would care because this distinguishes the contexts where LLMs provide meaningful advantages in clinical workflows.

Core claim

The study finds that LLMs offer limited advantages over traditional classifiers in static referral accuracy based on fixed patient information. In dynamic settings involving multi-turn dialogue, LLMs outperform classifiers by asking discriminative follow-up questions that reduce uncertainty over candidate departments. This indicates that the value of LLMs in outpatient referral lies in supporting interactive clinical decision-making rather than static prediction.

What carries the argument

Dynamic multi-turn dialogue scenarios that simulate information acquisition and uncertainty reduction over candidate departments

If this is right

Outpatient referral systems benefit more from interactive questioning strategies than from improved static classifiers.
LLMs can be leveraged specifically for generating adaptive follow-up questions in clinical interactions.
Traditional machine learning models remain competitive for initial assessments but are less effective in evolving information contexts.
Clinical AI tools should prioritize uncertainty reduction mechanisms in their design for referral tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These findings may extend to other medical decision processes that involve sequential information gathering, such as diagnosis or treatment selection.
Validation against real clinician-patient conversation data would strengthen the applicability of the dynamic model.
Hybrid systems combining classifiers for quick triage and LLMs for dialogue could optimize referral efficiency.
Deployment in hospitals might lead to fewer misdirected referrals if questioning reduces initial uncertainty effectively.

Load-bearing premise

The multi-turn dialogue scenarios used in the study accurately reflect the information-acquisition and uncertainty-reduction processes in actual outpatient referral workflows.

What would settle it

A study comparing LLM and classifier performance using real recorded multi-turn referral conversations from hospitals, where LLMs fail to show superior uncertainty reduction, would falsify the dynamic advantage claim.

read the original abstract

Outpatient referral (OR) is a core clinical workflow that assigns patients to hospital departments under incomplete and evolving information, yet it is commonly simplified as a static classification problem despite being inherently interactive in practice. In this work, we study outpatient referral as a dynamic process driven by information acquisition and uncertainty reduction. We analyze both static scenarios based on fixed patient information and dynamic scenarios involving multi-turn dialogue, to test whether large language models (LLMs) improve referral outcomes through better prediction or more effective questioning. Our findings show that LLMs offer limited advantages over traditional classifiers in static referral accuracy, but consistently outperform them in dynamic settings by asking discriminative follow-up questions that reduce uncertainty over candidate departments. These results suggest that the primary value of LLMs in outpatient referral lies not in static prediction, but in supporting interactive, uncertainty-aware clinical decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main finding is that LLMs add little over classifiers on fixed patient data for outpatient referral but pull ahead in simulated multi-turn dialogues by generating useful follow-up questions.

read the letter

The core takeaway here is that LLMs show limited edge in static referral classification but gain ground once the task turns interactive. The authors set up both fixed-information cases and multi-turn dialogues where the model asks questions to narrow department options, and they report better uncertainty reduction and final accuracy in the latter. That distinction between prediction and information-gathering is the clearest new angle; most prior work on referral stays in the static bucket, so framing the problem as dynamic information acquisition is a reasonable shift. The abstract also keeps the claim modest: LLMs are not portrayed as superior predictors overall, just better at the questioning step in their setup. That restraint is useful. On the execution side, the stress-test concern about the dialogues lands. The abstract gives no sign that the multi-turn scenarios were checked against real referral logs, clinician notes, or expert ratings of question utility. If the simulated patient answers or the distribution of uncertainty across departments do not track actual outpatient flows, the measured advantage in information gain becomes hard to interpret as clinical evidence. The paper would be stronger with even a small validation step, such as having clinicians rate the generated questions or comparing against logged referral conversations. Methods details are also thin in the abstract, so the statistical controls, dataset size, and exact baselines matter for judging whether the dynamic win holds up. Readers working on medical decision support or interactive AI systems would get the most from this; it is not a broad methods paper. The work is coherent enough on its own terms to warrant referee time, mainly to check the simulation design and any additional controls in the full text. I would send it out rather than desk-reject.

Referee Report

2 major / 0 minor

Summary. The paper claims that LLMs show only limited gains over traditional classifiers on static outpatient referral accuracy but consistently outperform them in dynamic multi-turn dialogue settings by generating more discriminative follow-up questions that reduce uncertainty over candidate departments; the primary value of LLMs is therefore argued to lie in interactive, uncertainty-aware decision support rather than static prediction.

Significance. If the empirical comparison is robust, the result would usefully shift attention from static classification benchmarks toward interactive information-acquisition protocols in clinical NLP, with potential implications for designing LLM agents that support real referral workflows.

major comments (2)

[Abstract and dynamic-scenario construction] The headline finding that LLMs outperform classifiers specifically via better questioning in dynamic settings rests on the unvalidated assumption that the constructed multi-turn dialogues produce uncertainty reduction comparable to actual outpatient referral workflows; the manuscript supplies no grounding in real referral logs, clinician think-aloud data, or post-hoc expert review of question utility (abstract and §3–4).
[Abstract] The abstract states clear findings yet supplies no description of datasets, patient cohort size, department taxonomy, statistical tests, or controls for static vs. dynamic conditions, preventing assessment of whether the reported accuracy and information-gain differences are reliable (abstract).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our study on dynamic outpatient referral. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract and dynamic-scenario construction] The headline finding that LLMs outperform classifiers specifically via better questioning in dynamic settings rests on the unvalidated assumption that the constructed multi-turn dialogues produce uncertainty reduction comparable to actual outpatient referral workflows; the manuscript supplies no grounding in real referral logs, clinician think-aloud data, or post-hoc expert review of question utility (abstract and §3–4).

Authors: We agree this is a valid concern. The dynamic scenarios rely on constructed multi-turn dialogues designed to isolate information-acquisition effects rather than real referral logs or clinician think-aloud protocols. This controlled setup enables direct comparison of questioning strategies but does not claim ecological equivalence to live workflows. We will add an explicit limitations subsection in the discussion acknowledging the absence of real-world grounding and outlining future validation steps with clinical data. The current results are presented as evidence of relative advantage in simulated interactive settings, not as direct proof of clinical deployment readiness. revision: partial
Referee: [Abstract] The abstract states clear findings yet supplies no description of datasets, patient cohort size, department taxonomy, statistical tests, or controls for static vs. dynamic conditions, preventing assessment of whether the reported accuracy and information-gain differences are reliable (abstract).

Authors: We will revise the abstract to include concise details on the dataset source and size, the department taxonomy, the statistical tests employed, and the controls distinguishing static from dynamic conditions. These additions will be made without exceeding standard abstract length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison without derivations or self-referential predictions

full rationale

The paper is an empirical study that constructs static and dynamic referral scenarios, runs LLMs and traditional classifiers on them, and reports comparative accuracy and questioning behavior. No equations, parameter fits presented as predictions, uniqueness theorems, or self-citation chains appear in the provided abstract or described methodology. The central claim rests on experimental outcomes rather than any reduction of a result to its own inputs by construction. This matches the reader's assessment that the work contains no derivations or fitted-input predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is purely empirical; no mathematical derivations or new formal models are described in the abstract, so the ledger contains only the core domain assumption needed to interpret the dynamic results.

axioms (1)

domain assumption The simulated multi-turn dialogues reflect real clinical information needs and uncertainty reduction in outpatient referral
This premise is required to interpret the performance difference between static and dynamic conditions as clinically meaningful.

pith-pipeline@v0.9.0 · 5701 in / 1148 out tokens · 55886 ms · 2026-05-23T00:38:06.281901+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PrinciplismQA: A Philosophy-Grounded Approach to Assessing LLM-Human Clinical Medical Ethics Alignment
cs.CL 2025-08 unverdicted novelty 6.0

PrinciplismQA benchmark reveals significant gaps in LLMs' clinical ethical reasoning despite high knowledge accuracy.