Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries
Pith reviewed 2026-05-20 19:23 UTC · model grok-4.3
The pith
Few-shot prompted LLMs outperform BioBERT baselines for four-class triage of online patient inquiries but require human oversight.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral -- and find that the strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-F1 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.
What carries the argument
Four-class actionable triage task using few-shot prompting of LLMs on informal patient text, evaluated against silver-trained TF-IDF and BioBERT baselines with macro-F1 plus safety metrics of emergency-recall, under-triage rate, and severe under-triage rate.
If this is right
- LLMs under few-shot conditions can prioritize patient messages for clinician attention when labeled data is scarce.
- Two-model agreement raises reliability for self-care decisions but leaves urgent-clinician-review decisions less stable.
- Safety-aware metrics such as emergency-recall and under-triage rate must stay within acceptable bounds for any deployment.
- The current performance level supports selective human review rather than replacing clinicians entirely.
Where Pith is reading between the lines
- Such triage models could be inserted as an initial filter in patient portals to reduce the number of messages a clinician sees first.
- Testing the same prompting setup on live streams of messages from multiple clinics would reveal how well the silver-label training generalizes.
- Adding per-prediction confidence thresholds might let the system route only the lowest-confidence cases to humans automatically.
Load-bearing premise
The 300-example human-calibrated gold evaluation set and the auto-labeled silver training set from HealthCareMagic-100K are sufficiently representative and accurate to support reliable comparisons and safety conclusions for real-world patient inquiries.
What would settle it
A new collection of several hundred independently expert-labeled real-world patient inquiries where LLM triage outputs are compared directly to clinician decisions and actual clinical outcomes would show whether the macro-F1 edge and under-triage rates remain acceptable.
Figures
read the original abstract
Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-$F_1$ alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate. The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-$F_1$ 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript empirically compares few-shot prompted LLMs (including Claude Haiku 4.5) against TF-IDF and BioBERT baselines for four-class actionable triage of online patient inquiries (self-care, schedule-visit, urgent-clinician-review, emergency-referral). Using HealthCareMagic-100K, the authors construct a 300-example human-calibrated gold evaluation set and a 700-example auto-labeled silver training set. Baselines are trained on silver labels; LLMs are evaluated in 0/4/12-shot regimes. Primary metrics are macro-F1 plus safety-aware quantities (emergency-recall, under-triage rate, severe under-triage rate). The strongest LLM reaches macro-F1 0.475 (vs. BioBERT 0.378) on point estimate with overlapping confidence intervals. The authors conclude that LLMs can support prioritization and selective human review but not autonomous deployment.
Significance. If the findings hold, the work supplies concrete evidence that few-shot LLMs can outperform supervised models trained on noisy silver labels in a safety-sensitive medical routing task. Explicit reporting of overlapping confidence intervals and inclusion of under-triage metrics are positive features that support a measured interpretation. The study contributes to low-resource healthcare NLP by showing practical limits of current prompting approaches. Credit is due for the safety-focused evaluation design and the explicit caveat against autonomous use.
major comments (2)
- [§3.2 and §4.1] §3.2 and §4.1: The 300-example human-calibrated gold set is small for reliable estimation of tail-sensitive safety metrics such as under-triage rate and severe under-triage rate. No class distribution, inter-annotator agreement, or calibration protocol is reported, which directly affects the trustworthiness of the safety conclusions that underwrite the central claim.
- [§3.1] §3.1: The auto-labeling procedure used to create the 700-example silver training set is described at a high level only. Without details on the labeling model, threshold choices, or error analysis, it is impossible to rule out systematic biases that could disadvantage the supervised baselines and thereby exaggerate the relative advantage of the prompted LLMs.
minor comments (3)
- [Abstract] Abstract and §4: The notation 'macro-$F_1$' should be expanded to 'macro-averaged F1' on first use for clarity.
- [Results] Table 2 (or equivalent results table): Confidence intervals are mentioned in text but not shown for all safety metrics; adding them would improve interpretability of the overlapping-CI statement.
- [§2] §2: A brief reference to prior work on medical triage datasets or LLM safety evaluation would help situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments identify important gaps in methodological transparency that affect the interpretability of our safety metrics and baseline comparisons. We have revised the manuscript to address these points directly while preserving the original scope and data constraints of the study.
read point-by-point responses
-
Referee: [§3.2 and §4.1] The 300-example human-calibrated gold set is small for reliable estimation of tail-sensitive safety metrics such as under-triage rate and severe under-triage rate. No class distribution, inter-annotator agreement, or calibration protocol is reported, which directly affects the trustworthiness of the safety conclusions that underwrite the central claim.
Authors: We agree that 300 examples is modest for stable estimation of low-frequency safety events and that the original submission should have included more detail on the evaluation set construction. In the revised manuscript we now report the class distribution of the gold set in §3.2, expand the description of the calibration protocol (including annotator instructions and review process), and add an explicit limitations paragraph discussing the implications of the modest sample size for tail metrics. Inter-annotator agreement was not computed because annotations were performed by a single domain expert with subsequent review; we have stated this limitation plainly rather than claiming agreement statistics. These additions support a more cautious reading of the safety results without altering our central conclusion that the approach is suitable only for selective human review. revision: partial
-
Referee: [§3.1] The auto-labeling procedure used to create the 700-example silver training set is described at a high level only. Without details on the labeling model, threshold choices, or error analysis, it is impossible to rule out systematic biases that could disadvantage the supervised baselines and thereby exaggerate the relative advantage of the prompted LLMs.
Authors: We accept that the original high-level description limited readers' ability to assess potential label noise or bias. The revised §3.1 now supplies the concrete details of the auto-labeling process, including the model used, the prompting approach, any confidence thresholds applied, and a short error analysis performed on a held-out sample of silver labels. These additions make it possible to evaluate whether systematic biases exist and how they might affect the supervised baselines relative to the few-shot LLM results. We note that any such biases would apply equally to the TF-IDF and BioBERT models trained on the silver data, while the LLM evaluations rely on the human-calibrated few-shot pool. revision: yes
Circularity Check
No circularity in empirical evaluation
full rationale
The paper reports a direct empirical comparison of few-shot prompted LLMs against TF-IDF and BioBERT baselines trained on silver labels and evaluated on a held-out human-calibrated gold set. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. All reported metrics (macro-F1, emergency-recall, under-triage rates) are computed from the evaluation data rather than reduced to inputs by construction, so the central claims remain independent of any internal redefinition or load-bearing self-reference.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Auto-labeled silver set from HealthCareMagic-100K provides useful training signal for TF-IDF and BioBERT baselines
- domain assumption 300 human-calibrated gold examples form a reliable test set for macro-F1 and safety metrics
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-F1 alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-F1 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A zero-shot and few-shot study of instruction- finetuned large language models applied to clinical and biomedical tasks. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2049–2066, Torino, Italia. ELRA and ICCL. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghye...
work page 2024
-
[2]
Bioinformatics, 36(4):1234–1240
BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240. Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023. ChatDoctor: A medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus, 15(6):e40895. Lars Masann...
work page 2023
-
[3]
Emergency-enriched: emergency score ≥2 , up to 1,200 candidates
-
[4]
Self-care: self-care keywords present and no self-care exclusion keywords, up to 800 can- didates
-
[5]
Urgent: urgent keywords present, up to 500 candidates
-
[6]
Schedule: schedule, chronic-care, follow-up, or appointment-related keywords present, up to 500 candidates. Within each bucket, records were sorted by pri- ority score before selection. Random seed 42 was used for stochastic operations. Keyword lists.The following keyword lists were used for emergency enrichment and bucket assign- ment. 13 EMERGENCY_KEYWO...
-
[7]
Label by the most severe signal
-
[8]
When uncertain, choose the more se- vere label
-
[9]
Lower threshold for vulnerable popu- lations
-
[10]
Respond with JSON only. User prompt template. { "label": "<one of: self-care, schedule- visit, urgent-clinician-review, emergency -referral>", "confidence": "<high, medium, or low>", "reasoning": "<1-2 sentences explain- ing the triage decision>", "insufficient_info": <true or false>, "missing_info": "<what information is missing, or null>" } 17 C.9 Guide...
work page 2024
-
[11]
urgent-clinician-review
-
[12]
emergency-referral ### Definitions:: [self-care] Use for general health questions, mild short-duration symptoms, stable known conditions, medication/lifestyle questions, retrospective checks, or informational questions that do not require prompt clinician evaluation. [schedule-visit] Use when the message suggests the patient should have a non-urgent clini...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.