Benchmarking and Adapting On-Device LLMs for Clinical Decision Support
Pith reviewed 2026-05-16 20:53 UTC · model grok-4.3
The pith
Fine-tuned on-device LLMs reach 87.9% accuracy on clinical diagnosis, approaching GPT-5.1 at 89.4%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On-device LLMs achieve performance comparable to or exceeding DeepSeek-R1 and GPT-5-mini across the three tasks despite smaller size. Fine-tuning Qwen3.5-35B on general diagnostic data produces 87.9 percent accuracy, approaching GPT-5.1 at 89.4 percent, while the base Gemma 4 31B model reaches 86.5 percent on general diagnosis. Most errors (87.2 percent) consist of clinically plausible differentials, and upper-bound analysis indicates up to 93.2 percent accuracy is attainable with improved answer selection.
What carries the argument
Benchmarking of on-device LLMs across general diagnosis, specialty ophthalmology tasks, and expert simulation, combined with targeted fine-tuning of models such as Qwen3.5-35B on diagnostic data to close the gap with proprietary systems.
If this is right
- On-device models can supply accurate diagnostic support without uploading patient data to external servers.
- Fine-tuning delivers large gains, bringing smaller open models within a few points of the strongest closed models.
- The bulk of remaining errors are plausible clinical differentials rather than nonsensical predictions.
- Local inference removes cloud latency and cost barriers for routine clinical integration.
Where Pith is reading between the lines
- Mobile or edge deployment could extend specialist-level decision support to clinics lacking reliable internet.
- Further fine-tuning on narrower specialty datasets would likely expand coverage to additional medical domains.
- Reduced dependence on proprietary APIs could lower costs and increase accessibility for smaller healthcare providers.
Load-bearing premise
The three chosen clinical tasks and evaluation metrics accurately reflect real-world clinical decision support needs and the fine-tuning data generalizes beyond the tested sets.
What would settle it
Deployment of the fine-tuned models in a live hospital setting where their diagnostic outputs are compared directly against board-certified physicians on new, unseen patient cases.
Figures
read the original abstract
Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often have large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark on-device LLMs from the gpt-oss (20b, 120b), Qwen3.5 (9B, 27B, 35B), and Gemma 4 (31B) families across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5.1, GPT-5-mini, and Gemini 3.1 Pro) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b and Qwen3.5-35B on general diagnostic data. Across tasks, on-device models achieve performance comparable to or exceeding DeepSeek-R1 and GPT-5-mini despite being substantially smaller. In addition, fine-tuning remarkably improves diagnostic accuracy, with the fine-tuned Qwen3.5-35B reaching 87.9% and approaching the proprietary GPT-5.1 (89.4%). Among base on-device models, Gemma 4 31B achieved the strongest general diagnostic accuracy at 86.5%, exceeding GPT-5-mini and approaching the fine-tuned Qwen3.5-35B. Error characterization revealed that 87.2% of diagnostic errors across all models were clinically plausible differentials rather than off-topic predictions, and upper-bound analysis showed up to 93.2% attainable accuracy through improved answer selection. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks on-device LLMs (gpt-oss 20b/120b, Qwen3.5 9B/27B/35B, Gemma 4 31B) against proprietary models (GPT-5.1, GPT-5-mini, Gemini 3.1 Pro) and DeepSeek-R1 on three clinical tasks: general disease diagnosis, ophthalmology diagnosis/management, and expert grading simulation. It reports that base models like Gemma 4 31B reach 86.5% on general diagnosis, and fine-tuning Qwen3.5-35B on general diagnostic data yields 87.9% accuracy, approaching GPT-5.1 at 89.4%. Error analysis shows 87.2% of errors are clinically plausible differentials, with an upper bound of 93.2% via improved selection.
Significance. If the fine-tuning gains and task results hold after verification of data integrity, the work shows that smaller on-device models can approach proprietary performance in clinical decision support while enabling local, privacy-preserving inference. This has clear practical value for resource-limited settings and demonstrates adaptability of open models via targeted fine-tuning.
major comments (3)
- [Abstract (and Methods)] The abstract provides no details on the provenance, size, train/test splits, or decontamination steps for the 'general diagnostic data' used to fine-tune Qwen3.5-35B and gpt-oss-20b. This is load-bearing for the central claim, as unverified overlap with the three evaluation tasks could explain the 87.9% result via leakage rather than genuine adaptation.
- [Results] Reported accuracies such as Gemma 4 31B at 86.5% and the fine-tuned Qwen3.5-35B at 87.9% lack sample sizes, variance estimates, confidence intervals, or statistical tests comparing to baselines like GPT-5-mini. This prevents assessment of whether performance orderings are stable.
- [Evaluation and Tasks] Task construction details (e.g., exact prompt formats, answer selection criteria, and how the three tasks map to real clinical workflows) are insufficient to evaluate whether the metrics (presumably exact-match or top-1) are clinically meaningful or generalizable.
minor comments (2)
- [Introduction and Models] Clarify the precise definition of 'on-device' (e.g., inference hardware constraints) and list exact parameter counts consistently across tables and text.
- [Error Analysis] Expand the error characterization section to describe how 'clinically plausible differentials' were annotated and by what criteria.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve transparency and rigor.
read point-by-point responses
-
Referee: [Abstract (and Methods)] The abstract provides no details on the provenance, size, train/test splits, or decontamination steps for the 'general diagnostic data' used to fine-tune Qwen3.5-35B and gpt-oss-20b. This is load-bearing for the central claim, as unverified overlap with the three evaluation tasks could explain the 87.9% result via leakage rather than genuine adaptation.
Authors: We agree that explicit details on the fine-tuning dataset are necessary to rule out leakage and support the adaptation claim. We will revise both the abstract and Methods section to specify the provenance of the general diagnostic data, its size, the train/test splits, and the decontamination steps taken to eliminate overlap with the evaluation tasks. revision: yes
-
Referee: [Results] Reported accuracies such as Gemma 4 31B at 86.5% and the fine-tuned Qwen3.5-35B at 87.9% lack sample sizes, variance estimates, confidence intervals, or statistical tests comparing to baselines like GPT-5-mini. This prevents assessment of whether performance orderings are stable.
Authors: We acknowledge the value of statistical context for interpreting the reported accuracies. We will update the Results section to include the exact sample sizes for each task, add confidence intervals or variance estimates, and incorporate appropriate statistical tests (e.g., McNemar's test) for comparisons against baselines such as GPT-5-mini. revision: yes
-
Referee: [Evaluation and Tasks] Task construction details (e.g., exact prompt formats, answer selection criteria, and how the three tasks map to real clinical workflows) are insufficient to evaluate whether the metrics (presumably exact-match or top-1) are clinically meaningful or generalizable.
Authors: We will expand the Evaluation and Tasks section to provide the exact prompt formats, clarify the answer selection criteria (top-1 exact match), and explicitly map each task to corresponding real-world clinical workflows. This will better demonstrate the clinical relevance and generalizability of the metrics. revision: yes
Circularity Check
No circularity in empirical benchmarking study
full rationale
The paper consists entirely of empirical benchmarking of on-device LLMs across three clinical tasks, comparison to proprietary models, and fine-tuning experiments on diagnostic data. No derivations, equations, fitted parameters presented as predictions, or self-citations are used as load-bearing premises. All reported accuracies (e.g., 87.9% for fine-tuned Qwen3.5-35B) are direct measurements from evaluation, with no reduction to inputs by construction. The analysis is self-contained against external model benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Adapted large language models can outperform medical experts in clinical text summarization,
D. Van V een, C. Van Uden, L. Blankemeier, J.-B. Delbrouck, A. Aali, C. Bluethgen, A. Pareek, M. Polacin, E. P . Reis, A. Seehofnerov ´a, N. Rohatgi, P . Hosamani, W. Collins, N. Ahuja, C. P . Langlotz, J. Hom, S. Gatidis, J. Pauly , and A. S. Chaudhari, “Adapted large language models can outperform medical experts in clinical text summarization,” Nature ...
work page 2024
-
[2]
A generalist medical language model for disease diagnosis assistance,
X. Liu, H. Liu, G. Yang, Z. Jiang, S. Cui, Z. Zhang, H. Wang, L. Tao, Y . Sun, Z. Song, T. Hong, J. Yang, T. Gao, J. Zhang, X. Li, J. Zhang, Y . Sang, Z. Yang, K. Xue, S. Wu, P . Zhang, J. Yang, C. Song, and G. Wang, “A generalist medical language model for disease diagnosis assistance,” Nature Medicine, vol. 31, no. 3, pp. 932–942, 2025
work page 2025
-
[3]
Large language models encode clinical knowledge,
K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P . Payne, M. Seneviratne, P . Gamble, C. Kelly , A. Babiker, N. Sch¨arli, A. Chowdhery , P . Mansfield, D. Demner-Fushman, B. Ag ¨uera y Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev , Y . Liu, A. Rajkomar, J. B...
work page 2023
-
[4]
Toward expert-level medical question answering with large language models,
K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewis, D. Neal, Q. M. Rashid, M. Schaekermann, A. Wang, D. Dash, J. H. Chen, N. H. Shah, S. Lachgar, P . A. Mansfield, S. Prakash, B. Green, E. Dominowska, B. Ag ¨uera y Arcas, N. Toma ˇsev , Y . Liu, R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, D. R...
work page 2025
-
[5]
Towards accurate differential diagnosis with large language models,
D. McDuff, M. Schaekermann, T. Tu, A. Palepu, A. Wang, J. Garrison, K. Singhal, Y . Sharma, S. Azizi, K. Kulkarni, L. Hou, Y . Cheng, Y . Liu, S. S. Mahdavi, S. Prakash, A. Pathak, C. Semturs, S. Patel, D. R. Webster, E. Dominowska, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, J. Sunshine, A. Karthikesalingam, and V . Natarajan, “Towards ac...
work page 2025
-
[6]
Towards conversational diagnostic artificial intelligence,
T. Tu, M. Schaekermann, A. Palepu, K. Saab, J. Freyberg, R. Tanno, A. Wang, B. Li, M. Amin, Y . Cheng, E. V edadi, N. Tomasev , S. Azizi, K. Singhal, L. Hou, A. Webson, K. Kulkarni, S. S. Mahdavi, C. Semturs, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, A. Karthikesalingam, and V . Natarajan, “Towards conversational diagnostic artificial int...
work page 2025
-
[7]
AI-based clinical decision support for primary care: A real-world study ,
R. Korom, S. Kiptinness, N. Adan, K. Said, C. Ithuli, O. Rotich, B. Kimani, I. King’ori, S. Kamau, E. Atemba et al., “AI-based clinical decision support for primary care: A real-world study ,” arXiv preprint arXiv:2507.16947 , 2025
-
[8]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,
D. Guo, D. Yang, H. Zhang, J. Song, P . Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. ...
work page 2025
-
[9]
M. Tordjman, Z. Liu, M. Yuce, V . Fauveau, Y . Mei, J. Hadjadj, I. Bolger, H. Almansour, C. Horst, A. S. Parihar, A. Geahchan, A. Meribout, N. Yatim, N. Ng, P . Robson, A. Zhou, S. Lewis, M. Huang, T. Deyer, B. Taouli, H.-C. Lee, Z. A. Fayad, and X. Mei, “Comparative benchmarking of the deepseek large language model on medical tasks and clinical reasoning...
work page 2025
-
[10]
Benchmark evaluation of deepseek large language models in clinical decision-making,
S. Sandmann, S. Hegselmann, M. Fujarski, L. Bickmann, B. Wild, R. Eils, and J. Varghese, “Benchmark evaluation of deepseek large language models in clinical decision-making,” Nature Medicine, vol. 31, no. 8, p. 2546–2549, 2025. 9
work page 2025
-
[11]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V . Fomenko, T. Garipov , K. Georgiev , M. Glaese, T. Gogineni, A. Goucher, L...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
P . Xu, Y . Wu, K. Jin, X. Chen, M. He, and D. Shi, “DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning,” Advances in Ophthalmology Practice and Research , vol. 5, no. 3, pp. 189–195, 2025. 1 Supplementary ✦ Prompt Prompt 1. Prompt for General Radiology Diagnosis You are a careful radiology diagnosi...
work page 2025
-
[13]
Output EXACTLY one option, copied VERBATIM from the list
-
[14]
Output ONLY the diagnosis text. No explanation. No punctuation. No quotes. Prompt Prompt 2. Prompts for Ophthalmology QA You are a careful ophthalmology question-answering assistant. You will be given a multiple-choice case with options labeled A–Z. Some questions have a single correct answer, while others have multiple correct answers. Select ALL correct...
-
[15]
Most relevant diagnoses not mentioned
-
[16]
Many relevant diagnoses missing or incorrectly identified
-
[17]
Some relevant diagnoses mentioned, but important omissions or inaccuracies present
-
[18]
Most relevant diagnoses correctly identified, with only minor omissions
-
[19]
All relevant diagnoses correctly identified. Instruction: Given the following task description, the true disease, and the model output, assign a single integer score from 1 to 5 according to the rubric. Half-point scores (e.g., 1.5, 2.5, 3.5, 4.5) are allowed if the quality falls between two rubric levels. Output only the score , with no explanation or jus...
-
[20]
All or most suggested options redundant or unjustified
-
[21]
Some suggested options redundant or unjustified
-
[22]
Most suggested options appropriate, but minor redundancy or weak justification present
-
[23]
Few suggested options redundant or unjustified
-
[24]
No suggested options redundant or unjustified. Instruction: Given the following task description, the true disease, and the model output, assign a score from 1 to 5 according to the rubric. Half-point scores (e.g., 1.5, 2.5, 3.5, 4.5) are allowed if the quality falls between two rubric levels. Output only the score , with no explanation or justification. In...
-
[25]
Connect symptoms to findings : Link clinical presentation with imaging observations
-
[26]
Map to differentials : Show how findings support or contradict each differential diagnosis
-
[27]
Systematic elimination : Explicitly rule out less likely options with reasoning
-
[28]
Performance is evaluated using self-consistency majority voting accuracy (%, 95% CI)
Converge to answer : Demonstrate the logical path to the correct diagnosis 4 T ABLE 1 LLM-as-a-Generalist Task: Comparative diagnostic accuracy across radiological anatomical subgroups. Performance is evaluated using self-consistency majority voting accuracy (%, 95% CI). While the proprietary GPT -5 sets the highest overall average performance (88.9% aver...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.