arxiv: 2604.26766 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Domain-Adapted Small Language Models for Reliable Clinical Triage

Manar Aljohani , Brandon Ho , Kenneth McKinley , Dennis Ren , Xuan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords clinical triagesmall language modelsdomain adaptationEmergency Severity Indexfine-tuningpediatric triagedecision supportmistriage

0 comments

The pith

Fine-tuned Qwen2.5-7B models using pediatric triage data cut discordance and serious errors below those of GPT-4o and other small models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether open-source small language models can deliver accurate and consistent Emergency Severity Index assignments despite the variability of free-text triage notes that often leads to mistriage. It identifies clinical vignettes as the most effective prompting format and shows that the base Qwen2.5-7B model already offers a strong accuracy-efficiency balance among small models. Large-scale domain adaptation on expert-curated and silver-standard pediatric data then produces further gains that lower both overall discordance and clinically significant errors. These improvements exceed the results from all other tested small models and from proprietary large models such as GPT-4o, pointing to a workable path for private, institution-specific clinical decision support.

Core claim

After systematic comparison across prompting strategies, the Qwen2.5-7B model emerges as the strongest baseline small language model. Large-scale fine-tuning on expert-curated and silver-standard pediatric triage datasets then yields models that achieve substantially lower discordance rates and fewer clinically significant errors than all baseline small language models and than advanced proprietary large language models including GPT-4o, while preserving computational efficiency.

What carries the argument

Domain adaptation of the Qwen2.5-7B small language model through fine-tuning on expert-curated and silver-standard pediatric triage data, which aligns model outputs more closely with expert Emergency Severity Index standards.

If this is right

Targeted fine-tuning on domain-specific data produces larger accuracy gains than more elaborate prompting strategies alone.
Small language models can function as privacy-preserving alternatives to proprietary large models for clinical triage support.
Institution-specific adaptation of open-source models reduces risks of mistriage and workflow inefficiencies.
The approach demonstrates that smaller models can reach or exceed the reliability of much larger systems when adapted to the target clinical task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fine-tuning recipe could be applied to other variable free-text clinical tasks such as discharge summarization or risk scoring.
If the observed gains persist on adult data and in prospective deployment, emergency departments could integrate such models directly into electronic health records for real-time assistance.
Testing the adapted models on triage notes from multiple institutions and age groups would clarify whether pediatric-focused training introduces unintended biases.
Hospitals could combine the fine-tuned small model with human review protocols to create hybrid triage workflows that minimize both errors and external data sharing.

Load-bearing premise

The expert-curated and silver-standard pediatric triage data used for fine-tuning is representative of real-world adult and pediatric triage documentation, and performance gains generalize to live clinical settings without overfitting to the evaluation vignettes.

What would settle it

Applying the fine-tuned model to a large set of real-world adult triage notes from live emergency department records and observing whether rates of clinically significant errors and discordance remain lower than those produced by GPT-4o.

Figures

Figures reproduced from arXiv: 2604.26766 by Brandon Ho, Dennis Ren, Kenneth McKinley, Manar Aljohani, Xuan Wang.

**Figure 1.** Figure 1: Overview of the methodological pipeline, illustrating data and training sources, silver-standard vignette generation, six prompting pipelines, QLoRA fine-tuning of Qwen2.5-7B, evaluation metrics, explainability analysis, and additional multi-agent and RAG experiments. closed-box systems, require external API calls that raise HIPAA concerns, and demand substantial compute resources that hinder on-premise de… view at source ↗

**Figure 2.** Figure 2: Fine-tuning comparisons for Qwen2.5 models on ESI prediction from clinical vignettes. Top row: Qwen2.5- 7B; bottom row: Qwen2.5-14B-Instruct. X-axis settings are discrete: Base = unfine-tuned; ESI = fine-tuned on synthetic ESI Handbook vignettes; 2k, 5k, 10k = fine-tuned on 2,000, 5,000, and 10,000 CNH silver examples; C1– C10 = sequential chunk-based fine-tuning stages. Panels show discordance, under-/ove… view at source ↗

**Figure 3.** Figure 3: Token-level attention patterns for correct vs. incorrect ESI predictions across ESI-2 and ESI-3. inference cost (0.36–0.83 s/encounter). Qwen2.5-7B retained exceptionally fast inference (0.15–0.28 s/encounter), reinforcing its suitability for real-time emergency department workflows. These results show that Qwen2.5-7B is the most effective and practical model for institution-specific fine-tuning, offering … view at source ↗

read the original abstract

Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small language models (SLMs) can serve as reliable, privacy-preserving decision-support tools for clinical triage. We systematically compared multiple SLMs across diverse prompting pipelines and found that clinical vignettes, concise summaries of triage narratives, yielded the most accurate predictions. The SLM, Qwen2.5-7B, demonstrated the strongest balance of accuracy, stability, and computational efficiency. Through large-scale domain adaptation using expert-curated and silver-standard pediatric triage data, fine-tuned Qwen2.5-7B models substantially reduced discordance and clinically significant errors, outperforming all baseline SLMs and advanced proprietary large language models (LLMs, e.g., GPT-4o). These findings highlight the feasibility of institution-specific SLMs for reliable, privacy-preserving ESI decision support and underscore the importance of targeted fine-tuning over more complex inference strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that domain-adapted small language models can provide reliable clinical triage support. Specifically, after identifying Qwen2.5-7B as the best-performing SLM for ESI prediction using clinical vignettes, the authors fine-tune it on large-scale expert-curated and silver-standard pediatric triage data. The resulting models exhibit lower discordance and fewer clinically significant errors than baseline SLMs and proprietary LLMs including GPT-4o, supporting the feasibility of institution-specific, privacy-preserving SLMs for ESI decision support.

Significance. Should the findings prove robust, the significance is considerable for applied clinical NLP. Demonstrating that fine-tuned 7B-parameter open models can surpass GPT-4o in a high-stakes task like triage underscores the power of domain adaptation. This could encourage more institutions to develop local models rather than relying on cloud-based LLMs, with benefits for data privacy and customization. The work also contributes to understanding when fine-tuning is preferable to prompting strategies.

major comments (2)

[Abstract] The central claim that fine-tuned Qwen2.5-7B models 'substantially reduced discordance and clinically significant errors' while outperforming GPT-4o rests on pediatric data. No evidence is provided for generalization to adult triage or live ED settings, where documentation styles and patient demographics differ markedly. This is load-bearing for the paper's title and conclusions regarding 'reliable clinical triage'.
[Methods] Insufficient detail is given on the construction of the silver-standard dataset and the fine-tuning hyperparameters. Without this, it is difficult to assess whether the performance gains are due to genuine domain adaptation or artifacts of the data generation process.

minor comments (2)

[Abstract] The abstract would benefit from a brief mention of the dataset sizes or the magnitude of improvement (e.g., percentage reduction in errors) to give readers a sense of scale.
Ensure that all acronyms (e.g., ESI, SLM, LLM) are defined at first use in the main text.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating revisions made to the manuscript where applicable.

read point-by-point responses

Referee: [Abstract] The central claim that fine-tuned Qwen2.5-7B models 'substantially reduced discordance and clinically significant errors' while outperforming GPT-4o rests on pediatric data. No evidence is provided for generalization to adult triage or live ED settings, where documentation styles and patient demographics differ markedly. This is load-bearing for the paper's title and conclusions regarding 'reliable clinical triage'.

Authors: We acknowledge that the empirical results are derived from pediatric triage vignettes and datasets, as stated in the abstract and methods. The title and conclusions frame the work as demonstrating the feasibility of domain-adapted SLMs for clinical triage, with the pediatric setting serving as a concrete, high-stakes test case. In the revised manuscript, we have updated the abstract, introduction, and discussion to explicitly qualify the scope as pediatric ED triage and to note that extension to adult populations or prospective live-ED deployment would require additional validation studies. This preserves the core finding that targeted fine-tuning yields measurable improvements over baselines and larger models in the evaluated domain. revision: partial
Referee: [Methods] Insufficient detail is given on the construction of the silver-standard dataset and the fine-tuning hyperparameters. Without this, it is difficult to assess whether the performance gains are due to genuine domain adaptation or artifacts of the data generation process.

Authors: We agree that additional methodological transparency is warranted. The revised manuscript now includes an expanded Methods section detailing the silver-standard dataset construction (including the rule-based and model-assisted labeling procedures, validation against expert-curated subsets, and any filtering steps applied) as well as the complete fine-tuning configuration (learning rate schedule, number of epochs, batch size, optimizer, and any regularization techniques). These additions allow readers to evaluate the domain-adaptation process directly. revision: yes

standing simulated objections not resolved

Direct empirical evidence for generalization to adult triage or live ED operational settings cannot be provided from the current study, as it would require new data collection and prospective evaluation outside the manuscript's scope.

Circularity Check

0 steps flagged

No circularity; empirical performance claims rest on held-out evaluations

full rationale

The paper reports an empirical comparison of SLMs on ESI triage prediction, followed by domain-adaptive fine-tuning on expert-curated and silver-standard pediatric data, with reported gains measured against baselines and GPT-4o on clinical vignettes. No equations, theoretical derivations, or load-bearing self-citations appear in the provided abstract or described claims. Performance improvements are presented as measured outcomes rather than presupposed by the method definition, and the central result (reduced discordance after fine-tuning) is not equivalent to the input data or prompting choices by construction. Generalization concerns raised by the skeptic are external-validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all claims rest on empirical evaluation whose details are unavailable.

pith-pipeline@v0.9.0 · 5487 in / 1083 out tokens · 47195 ms · 2026-05-07T13:20:55.769016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 1 canonical work pages

[1]

Emergency Severity Index (ESI): A Triage Tool for Emergency Department Care

Gilboy N, Tanabe T, Travers D, Rosenau A. Emergency Severity Index (ESI): A Triage Tool for Emergency Department Care. Agency for Healthcare Research and Quality (AHRQ); 2020

2020
[2]

Machine learning models for emergency department triage: a systematic review

Van der Linden Cea. Machine learning models for emergency department triage: a systematic review. Emergency Medicine Journal. 2021;38(9):679-85

2021
[3]

Emergency severity index version 4 and triage of pediatric emergency department patients

Sax DR, Warton EM, Kene MV , Ballard DW, Vitale TJ, Timm JA, et al. Emergency severity index version 4 and triage of pediatric emergency department patients. JAMA pediatrics. 2024;178(10):1027-34

2024
[4]

Evaluation of Version 4 of the emergency severity index in us emergency departments for the rate of mistriage

Sax DR, Warton EM, Mark DG, Vinson DR, Kene MV , Ballard DW, et al. Evaluation of Version 4 of the emergency severity index in us emergency departments for the rate of mistriage. JAMA Network Open. 2023;6(3):e233404-4

2023
[5]

Accuracy and reliability of emer- gency department triage using the emergency severity index: an international multicenter assessment

Mistry B, De Ramirez SS, Kelen G, Schmitz PS, Balhara KS, Levin S, et al. Accuracy and reliability of emer- gency department triage using the emergency severity index: an international multicenter assessment. Annals of emergency medicine. 2018;71(5):581-7

2018
[6]

Learning to Diagnose with LSTM Recurrent Neural Networks

Lipton ZC, Kale DC, Elkan C, Wetzel R. Learning to Diagnose with LSTM Recurrent Neural Networks. In: ICLR; 2015

2015
[7]

Publicly Available Clinical BERT Embed- dings

Alsentzer E, Murphy J, Boag W, Weng WH, Jin D, Naumann T, et al. Publicly Available Clinical BERT Embed- dings. In: Clinical NLP Workshop; 2019

2019
[8]

Language Models are Few-Shot Learners

Brown TB, Mann B, Ryder N, Subbiah M, et al. Language Models are Few-Shot Learners. NeurIPS. 2020

2020
[9]

Large Language Models Encode Clinical Knowledge

Singhal Kea. Large Language Models Encode Clinical Knowledge. Nature. 2023;620:545-52

2023
[10]

Capabilities of GPT-4 in medical and clinical reasoning

Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 in medical and clinical reasoning. npj Digital Medicine. 2023;6(1):194

2023
[11]

TriageAgent: Towards Better Multi-Agents Collaborations for Large Language Model-Based Clinical Triage

Lu M, Ho B, Ren D, Wang X. TriageAgent: Towards Better Multi-Agents Collaborations for Large Language Model-Based Clinical Triage. In: Al-Onaizan Y , Bansal M, Chen YN, editors. Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics

2024
[12]

p. 5747-64. Available from: https://aclanthology.org/2024.findings-emnlp.329/

2024
[13]

Med-LLM: Towards Open Foundation Models for Healthcare

Wu Cea. Med-LLM: Towards Open Foundation Models for Healthcare. arXiv preprint arXiv:230912058. 2023

2023
[14]

A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare

Aljohani M, Hou J, Kommu S, Wang X. A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare. In: Christodoulopoulos C, Chakraborty T, Rose C, Peng V , editors. Findings of the Associ- ation for Computational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics

2025
[15]

p. 6720-48. Available from: https://aclanthology.org/2025.findings-emnlp.356/

2025
[16]

Large language models in medicine: current applications and future prospects

Thirunavukarasu A, Ting DS, et al. Large language models in medicine: current applications and future prospects. Nature Medicine. 2023;29:1936-45

2023
[17]

Evaluation of Generative Artificial Intelligence Models in Predicting Pediatric Emergency Severity Index Levels

Ho B, Lu M, Wang X, Butler R, Park J, Ren D. Evaluation of Generative Artificial Intelligence Models in Predicting Pediatric Emergency Severity Index Levels. Pediatric Emergency Care. 2024:10-1097

2024
[18]

Small language models: Survey, measurements, and insights.arXiv preprint arXiv:2409.15790, 2024

Team O. TinyLLM: Efficient Small Language Models for On-Device Inference; 2024. arXiv preprint arXiv:2409.15790

work page arXiv 2024
[19]

Qlora: Efficient finetuning of quantized llms

Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems. 2023;36:10088-115

2023
[20]

LoRA: Low-Rank Adaptation of Large Language Models

Hu EJea. LoRA: Low-Rank Adaptation of Large Language Models. In: International Conference on Learning Representations (ICLR); 2022

2022
[21]

Assessing the Utility of ChatGPT for Medical Note Summarization and Clinical Decision Support

Jin Qea. Assessing the Utility of ChatGPT for Medical Note Summarization and Clinical Decision Support. NPJ Digital Medicine. 2023;6:65

2023