pith. sign in

arxiv: 2605.15680 · v1 · pith:QXFUZZOCnew · submitted 2026-05-15 · 💻 cs.CL · cs.LG· q-bio.QM

Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries

Pith reviewed 2026-05-20 19:23 UTC · model grok-4.3

classification 💻 cs.CL cs.LGq-bio.QM
keywords few-shot promptinglarge language modelstriage categorizationpatient inquirieshealthcare text classificationmacro-F1safety metricsBioBERT baseline
0
0 comments X

The pith

Few-shot prompted LLMs outperform BioBERT baselines for four-class triage of online patient inquiries but require human oversight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can route informal, incomplete online patient inquiries to the right level of clinical follow-up using only a small number of examples. It creates a 300-example human-checked gold test set and a silver-labeled training set drawn from the HealthCareMagic-100K corpus, then pits six prompted LLMs against TF-IDF and BioBERT baselines trained on the silver data. A sympathetic reader would care because the volume of patient messages is growing and any reliable low-label method could ease clinician workload while flagging the urgent cases. The strongest result is Claude Haiku 4.5 in 12-shot mode reaching macro-F1 of 0.475 versus BioBERT's 0.378, with overlapping confidence intervals; few-shot prompting and model agreement improve some classes more than others. The authors conclude the models are useful for prioritization and selective review but not for fully autonomous decisions.

Core claim

We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral -- and find that the strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-F1 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.

What carries the argument

Four-class actionable triage task using few-shot prompting of LLMs on informal patient text, evaluated against silver-trained TF-IDF and BioBERT baselines with macro-F1 plus safety metrics of emergency-recall, under-triage rate, and severe under-triage rate.

If this is right

  • LLMs under few-shot conditions can prioritize patient messages for clinician attention when labeled data is scarce.
  • Two-model agreement raises reliability for self-care decisions but leaves urgent-clinician-review decisions less stable.
  • Safety-aware metrics such as emergency-recall and under-triage rate must stay within acceptable bounds for any deployment.
  • The current performance level supports selective human review rather than replacing clinicians entirely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such triage models could be inserted as an initial filter in patient portals to reduce the number of messages a clinician sees first.
  • Testing the same prompting setup on live streams of messages from multiple clinics would reveal how well the silver-label training generalizes.
  • Adding per-prediction confidence thresholds might let the system route only the lowest-confidence cases to humans automatically.

Load-bearing premise

The 300-example human-calibrated gold evaluation set and the auto-labeled silver training set from HealthCareMagic-100K are sufficiently representative and accurate to support reliable comparisons and safety conclusions for real-world patient inquiries.

What would settle it

A new collection of several hundred independently expert-labeled real-world patient inquiries where LLM triage outputs are compared directly to clinician decisions and actual clinical outcomes would show whether the macro-F1 edge and under-triage rates remain acceptable.

Figures

Figures reproduced from arXiv: 2605.15680 by Jiafu Li, Liqi Zhou.

Figure 1
Figure 1. Figure 1: Trade-off between macro-F1 (x-axis) and under-triage rate (y-axis) across model configurations. Few-shot LLMs occupy the upper-left region: higher macro-F1 and lower under triage than supervised baselines. Model 0-shot 4-shot 12-shot Best setting ∆ best vs. 0-shot Claude Haiku 4.5 0.374 [0.318, 0.430] 0.422 [0.365, 0.479] 0.475 [0.413, 0.532] 12-shot +0.101 Llama3.1-8B 0.375 [0.323, 0.426] 0.351 [0.295, 0.… view at source ↗
Figure 2
Figure 2. Figure 2: Shared base prompt used for the 0-, 4-, 12-shot. [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Four labeled examples appended to the shared base prompt for the 4-shot condition. We include one [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Eight additional labeled examples appended on top of the 4-shot to construct the 12-shot. Together with [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
read the original abstract

Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-$F_1$ alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate. The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-$F_1$ 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript empirically compares few-shot prompted LLMs (including Claude Haiku 4.5) against TF-IDF and BioBERT baselines for four-class actionable triage of online patient inquiries (self-care, schedule-visit, urgent-clinician-review, emergency-referral). Using HealthCareMagic-100K, the authors construct a 300-example human-calibrated gold evaluation set and a 700-example auto-labeled silver training set. Baselines are trained on silver labels; LLMs are evaluated in 0/4/12-shot regimes. Primary metrics are macro-F1 plus safety-aware quantities (emergency-recall, under-triage rate, severe under-triage rate). The strongest LLM reaches macro-F1 0.475 (vs. BioBERT 0.378) on point estimate with overlapping confidence intervals. The authors conclude that LLMs can support prioritization and selective human review but not autonomous deployment.

Significance. If the findings hold, the work supplies concrete evidence that few-shot LLMs can outperform supervised models trained on noisy silver labels in a safety-sensitive medical routing task. Explicit reporting of overlapping confidence intervals and inclusion of under-triage metrics are positive features that support a measured interpretation. The study contributes to low-resource healthcare NLP by showing practical limits of current prompting approaches. Credit is due for the safety-focused evaluation design and the explicit caveat against autonomous use.

major comments (2)
  1. [§3.2 and §4.1] §3.2 and §4.1: The 300-example human-calibrated gold set is small for reliable estimation of tail-sensitive safety metrics such as under-triage rate and severe under-triage rate. No class distribution, inter-annotator agreement, or calibration protocol is reported, which directly affects the trustworthiness of the safety conclusions that underwrite the central claim.
  2. [§3.1] §3.1: The auto-labeling procedure used to create the 700-example silver training set is described at a high level only. Without details on the labeling model, threshold choices, or error analysis, it is impossible to rule out systematic biases that could disadvantage the supervised baselines and thereby exaggerate the relative advantage of the prompted LLMs.
minor comments (3)
  1. [Abstract] Abstract and §4: The notation 'macro-$F_1$' should be expanded to 'macro-averaged F1' on first use for clarity.
  2. [Results] Table 2 (or equivalent results table): Confidence intervals are mentioned in text but not shown for all safety metrics; adding them would improve interpretability of the overlapping-CI statement.
  3. [§2] §2: A brief reference to prior work on medical triage datasets or LLM safety evaluation would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important gaps in methodological transparency that affect the interpretability of our safety metrics and baseline comparisons. We have revised the manuscript to address these points directly while preserving the original scope and data constraints of the study.

read point-by-point responses
  1. Referee: [§3.2 and §4.1] The 300-example human-calibrated gold set is small for reliable estimation of tail-sensitive safety metrics such as under-triage rate and severe under-triage rate. No class distribution, inter-annotator agreement, or calibration protocol is reported, which directly affects the trustworthiness of the safety conclusions that underwrite the central claim.

    Authors: We agree that 300 examples is modest for stable estimation of low-frequency safety events and that the original submission should have included more detail on the evaluation set construction. In the revised manuscript we now report the class distribution of the gold set in §3.2, expand the description of the calibration protocol (including annotator instructions and review process), and add an explicit limitations paragraph discussing the implications of the modest sample size for tail metrics. Inter-annotator agreement was not computed because annotations were performed by a single domain expert with subsequent review; we have stated this limitation plainly rather than claiming agreement statistics. These additions support a more cautious reading of the safety results without altering our central conclusion that the approach is suitable only for selective human review. revision: partial

  2. Referee: [§3.1] The auto-labeling procedure used to create the 700-example silver training set is described at a high level only. Without details on the labeling model, threshold choices, or error analysis, it is impossible to rule out systematic biases that could disadvantage the supervised baselines and thereby exaggerate the relative advantage of the prompted LLMs.

    Authors: We accept that the original high-level description limited readers' ability to assess potential label noise or bias. The revised §3.1 now supplies the concrete details of the auto-labeling process, including the model used, the prompting approach, any confidence thresholds applied, and a short error analysis performed on a held-out sample of silver labels. These additions make it possible to evaluate whether systematic biases exist and how they might affect the supervised baselines relative to the few-shot LLM results. We note that any such biases would apply equally to the TF-IDF and BioBERT models trained on the silver data, while the LLM evaluations rely on the human-calibrated few-shot pool. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation

full rationale

The paper reports a direct empirical comparison of few-shot prompted LLMs against TF-IDF and BioBERT baselines trained on silver labels and evaluated on a held-out human-calibrated gold set. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. All reported metrics (macro-F1, emergency-recall, under-triage rates) are computed from the evaluation data rather than reduced to inputs by construction, so the central claims remain independent of any internal redefinition or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on assumptions about the quality and representativeness of auto-generated silver labels and the small human gold set; no new entities or fitted parameters beyond standard model choices are introduced.

axioms (2)
  • domain assumption Auto-labeled silver set from HealthCareMagic-100K provides useful training signal for TF-IDF and BioBERT baselines
    Used to train supervised baselines for comparison against prompted LLMs
  • domain assumption 300 human-calibrated gold examples form a reliable test set for macro-F1 and safety metrics
    Constructed as evaluation benchmark for all methods

pith-pipeline@v0.9.0 · 5818 in / 1359 out tokens · 42093 ms · 2026-05-20T19:23:44.003104+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-F1 alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate.

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-F1 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2049–2066, Torino, Italia

    A zero-shot and few-shot study of instruction- finetuned large language models applied to clinical and biomedical tasks. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2049–2066, Torino, Italia. ELRA and ICCL. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghye...

  2. [2]

    Bioinformatics, 36(4):1234–1240

    BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240. Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023. ChatDoctor: A medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus, 15(6):e40895. Lars Masann...

  3. [3]

    Emergency-enriched: emergency score ≥2 , up to 1,200 candidates

  4. [4]

    Self-care: self-care keywords present and no self-care exclusion keywords, up to 800 can- didates

  5. [5]

    Urgent: urgent keywords present, up to 500 candidates

  6. [6]

    can’t breathe

    Schedule: schedule, chronic-care, follow-up, or appointment-related keywords present, up to 500 candidates. Within each bucket, records were sorted by pri- ority score before selection. Random seed 42 was used for stochastic operations. Keyword lists.The following keyword lists were used for emergency enrichment and bucket assign- ment. 13 EMERGENCY_KEYWO...

  7. [7]

    Label by the most severe signal

  8. [8]

    When uncertain, choose the more se- vere label

  9. [9]

    Lower threshold for vulnerable popu- lations

  10. [10]

    label":

    Respond with JSON only. User prompt template. { "label": "<one of: self-care, schedule- visit, urgent-clinician-review, emergency -referral>", "confidence": "<high, medium, or low>", "reasoning": "<1-2 sentences explain- ing the triage decision>", "insufficient_info": <true or false>, "missing_info": "<what information is missing, or null>" } 17 C.9 Guide...

  11. [11]

    urgent-clinician-review

  12. [12]

    label":

    emergency-referral ### Definitions:: [self-care] Use for general health questions, mild short-duration symptoms, stable known conditions, medication/lifestyle questions, retrospective checks, or informational questions that do not require prompt clinician evaluation. [schedule-visit] Use when the message suggests the patient should have a non-urgent clini...