pith. sign in

arxiv: 2601.03266 · v2 · submitted 2025-12-18 · 💻 cs.CL · cs.AI

Benchmarking and Adapting On-Device LLMs for Clinical Decision Support

Pith reviewed 2026-05-16 20:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords on-device LLMsclinical decision supportfine-tuningmedical diagnosisprivacy-preserving AIbenchmarkingophthalmologydiagnostic accuracy
0
0 comments X

The pith

Fine-tuned on-device LLMs reach 87.9% accuracy on clinical diagnosis, approaching GPT-5.1 at 89.4%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks on-device LLMs from the Qwen3.5, Gemma 4, and gpt-oss families on three clinical tasks: general disease diagnosis, ophthalmology diagnosis and management, and simulation of expert grading. Base models already match or exceed smaller proprietary systems like GPT-5-mini, and fine-tuning the Qwen3.5-35B model lifts general diagnostic accuracy to 87.9 percent. This matters for clinical settings because the models run locally, avoiding the privacy risks and cloud dependency of proprietary services. Error patterns show that nearly all mistakes are medically reasonable alternatives rather than irrelevant outputs. The results indicate that modest adaptation can bring open local models close enough to top closed models for practical use.

Core claim

On-device LLMs achieve performance comparable to or exceeding DeepSeek-R1 and GPT-5-mini across the three tasks despite smaller size. Fine-tuning Qwen3.5-35B on general diagnostic data produces 87.9 percent accuracy, approaching GPT-5.1 at 89.4 percent, while the base Gemma 4 31B model reaches 86.5 percent on general diagnosis. Most errors (87.2 percent) consist of clinically plausible differentials, and upper-bound analysis indicates up to 93.2 percent accuracy is attainable with improved answer selection.

What carries the argument

Benchmarking of on-device LLMs across general diagnosis, specialty ophthalmology tasks, and expert simulation, combined with targeted fine-tuning of models such as Qwen3.5-35B on diagnostic data to close the gap with proprietary systems.

If this is right

  • On-device models can supply accurate diagnostic support without uploading patient data to external servers.
  • Fine-tuning delivers large gains, bringing smaller open models within a few points of the strongest closed models.
  • The bulk of remaining errors are plausible clinical differentials rather than nonsensical predictions.
  • Local inference removes cloud latency and cost barriers for routine clinical integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Mobile or edge deployment could extend specialist-level decision support to clinics lacking reliable internet.
  • Further fine-tuning on narrower specialty datasets would likely expand coverage to additional medical domains.
  • Reduced dependence on proprietary APIs could lower costs and increase accessibility for smaller healthcare providers.

Load-bearing premise

The three chosen clinical tasks and evaluation metrics accurately reflect real-world clinical decision support needs and the fine-tuning data generalizes beyond the tested sets.

What would settle it

Deployment of the fine-tuned models in a live hospital setting where their diagnostic outputs are compared directly against board-certified physicians on new, unseen patient cases.

Figures

Figures reproduced from arXiv: 2601.03266 by Alhusain Abdalla, Alif Munim, Bo Wang, Jun Ma, Leo Chen, Omar Ibrahim, Shuolin Yin.

Figure 1
Figure 1. Figure 1: Overview of the benchmark framework. This study compares the on-device LLMs with state-of-the-art open-source and proprietary LLMs across general disease diagnosis, specialty diagnosis and treatment recommendations on ophthalmology multiple-choice questions, and judgment for open-ended clinical decision questions. RESULTS Dataset and evaluation methods We mainly focus on assessing the performance of LLMs o… view at source ↗
Figure 2
Figure 2. Figure 2: Zero-shot and fine-tuning performance of on-device LLMs. a, Results of LLM-as-a-generalist: diagnosis accuracy on a wide range of radiological cases (N=207). L, M, and H denote low, medium, and high reasoning efforts, respectively. b, Results of LLM-as-a-specialist: accuracy on ophthalmology cases (N=130) with diagnosis and management multiple-choice questions. c, Results of LLM-as-a-clinical-judge: violin… view at source ↗
Figure 1
Figure 1. Figure 1: Parameter efficiency: Fine-tuned gpt-oss-20b model outperforms the 671B DeepSeek-R1. Comparative diagnostic accuracy of the fine-tuned gpt-oss-20b model (green) versus the open-source frontier DeepSeek-R1 (gray). Despite being significantly smaller, the fine-tuned on￾device model achieves a higher overall micro-average accuracy (86.5% vs 81.6%) and demonstrates superior performance in 7 out of 10 anatomica… view at source ↗
Figure 2
Figure 2. Figure 2: On-device versus cloud-based efficiency: Fine-tuned model surpasses o4-mini. Performance comparison between the fine-tuned gpt-oss-20b (red) and OpenAI’s efficient proprietary model, o4-mini (gray). The on-device model demonstrates robust generalization, exceeding the cloud-based baseline in overall accuracy (86.5% vs 84.1%) and achieving competitive results across diverse radiological specialties. Vertica… view at source ↗
Figure 3
Figure 3. Figure 3: Approaching the frontier: On-device model demonstrates competitive performance with GPT-5. The forest plot illustrates the diagnostic accuracy of the fine-tuned gpt-oss-20b (blue) relative to the state-of-the-art GPT-5 (gray). While GPT-5 maintains a slight lead in overall accuracy (88.9% vs 86.5%), the confidence intervals overlap significantly across the majority of subgroups (e.g., Musculoskeletal, Abdo… view at source ↗
Figure 4
Figure 4. Figure 4: Heatmap of diagnostic accuracy across anatomical subgroups and model architectures. Color intensity represents accuracy (Green=High, Red=Low). Base on-device models (middle columns) exhibit significant performance degradation in specialized domains such as Cardiovascular and Breast imaging. Fine-tuning the 20b model (far right column) effectively mitigates these domain-specific weaknesses, restoring perfor… view at source ↗
read the original abstract

Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often have large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark on-device LLMs from the gpt-oss (20b, 120b), Qwen3.5 (9B, 27B, 35B), and Gemma 4 (31B) families across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5.1, GPT-5-mini, and Gemini 3.1 Pro) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b and Qwen3.5-35B on general diagnostic data. Across tasks, on-device models achieve performance comparable to or exceeding DeepSeek-R1 and GPT-5-mini despite being substantially smaller. In addition, fine-tuning remarkably improves diagnostic accuracy, with the fine-tuned Qwen3.5-35B reaching 87.9% and approaching the proprietary GPT-5.1 (89.4%). Among base on-device models, Gemma 4 31B achieved the strongest general diagnostic accuracy at 86.5%, exceeding GPT-5-mini and approaching the fine-tuned Qwen3.5-35B. Error characterization revealed that 87.2% of diagnostic errors across all models were clinically plausible differentials rather than off-topic predictions, and upper-bound analysis showed up to 93.2% attainable accuracy through improved answer selection. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper benchmarks on-device LLMs (gpt-oss 20b/120b, Qwen3.5 9B/27B/35B, Gemma 4 31B) against proprietary models (GPT-5.1, GPT-5-mini, Gemini 3.1 Pro) and DeepSeek-R1 on three clinical tasks: general disease diagnosis, ophthalmology diagnosis/management, and expert grading simulation. It reports that base models like Gemma 4 31B reach 86.5% on general diagnosis, and fine-tuning Qwen3.5-35B on general diagnostic data yields 87.9% accuracy, approaching GPT-5.1 at 89.4%. Error analysis shows 87.2% of errors are clinically plausible differentials, with an upper bound of 93.2% via improved selection.

Significance. If the fine-tuning gains and task results hold after verification of data integrity, the work shows that smaller on-device models can approach proprietary performance in clinical decision support while enabling local, privacy-preserving inference. This has clear practical value for resource-limited settings and demonstrates adaptability of open models via targeted fine-tuning.

major comments (3)
  1. [Abstract (and Methods)] The abstract provides no details on the provenance, size, train/test splits, or decontamination steps for the 'general diagnostic data' used to fine-tune Qwen3.5-35B and gpt-oss-20b. This is load-bearing for the central claim, as unverified overlap with the three evaluation tasks could explain the 87.9% result via leakage rather than genuine adaptation.
  2. [Results] Reported accuracies such as Gemma 4 31B at 86.5% and the fine-tuned Qwen3.5-35B at 87.9% lack sample sizes, variance estimates, confidence intervals, or statistical tests comparing to baselines like GPT-5-mini. This prevents assessment of whether performance orderings are stable.
  3. [Evaluation and Tasks] Task construction details (e.g., exact prompt formats, answer selection criteria, and how the three tasks map to real clinical workflows) are insufficient to evaluate whether the metrics (presumably exact-match or top-1) are clinically meaningful or generalizable.
minor comments (2)
  1. [Introduction and Models] Clarify the precise definition of 'on-device' (e.g., inference hardware constraints) and list exact parameter counts consistently across tables and text.
  2. [Error Analysis] Expand the error characterization section to describe how 'clinically plausible differentials' were annotated and by what criteria.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve transparency and rigor.

read point-by-point responses
  1. Referee: [Abstract (and Methods)] The abstract provides no details on the provenance, size, train/test splits, or decontamination steps for the 'general diagnostic data' used to fine-tune Qwen3.5-35B and gpt-oss-20b. This is load-bearing for the central claim, as unverified overlap with the three evaluation tasks could explain the 87.9% result via leakage rather than genuine adaptation.

    Authors: We agree that explicit details on the fine-tuning dataset are necessary to rule out leakage and support the adaptation claim. We will revise both the abstract and Methods section to specify the provenance of the general diagnostic data, its size, the train/test splits, and the decontamination steps taken to eliminate overlap with the evaluation tasks. revision: yes

  2. Referee: [Results] Reported accuracies such as Gemma 4 31B at 86.5% and the fine-tuned Qwen3.5-35B at 87.9% lack sample sizes, variance estimates, confidence intervals, or statistical tests comparing to baselines like GPT-5-mini. This prevents assessment of whether performance orderings are stable.

    Authors: We acknowledge the value of statistical context for interpreting the reported accuracies. We will update the Results section to include the exact sample sizes for each task, add confidence intervals or variance estimates, and incorporate appropriate statistical tests (e.g., McNemar's test) for comparisons against baselines such as GPT-5-mini. revision: yes

  3. Referee: [Evaluation and Tasks] Task construction details (e.g., exact prompt formats, answer selection criteria, and how the three tasks map to real clinical workflows) are insufficient to evaluate whether the metrics (presumably exact-match or top-1) are clinically meaningful or generalizable.

    Authors: We will expand the Evaluation and Tasks section to provide the exact prompt formats, clarify the answer selection criteria (top-1 exact match), and explicitly map each task to corresponding real-world clinical workflows. This will better demonstrate the clinical relevance and generalizability of the metrics. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmarking study

full rationale

The paper consists entirely of empirical benchmarking of on-device LLMs across three clinical tasks, comparison to proprietary models, and fine-tuning experiments on diagnostic data. No derivations, equations, fitted parameters presented as predictions, or self-citations are used as load-bearing premises. All reported accuracies (e.g., 87.9% for fine-tuned Qwen3.5-35B) are direct measurements from evaluation, with no reduction to inputs by construction. The analysis is self-contained against external model benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a pure empirical benchmarking study. No mathematical derivations, new physical entities, or ad-hoc axioms are introduced; all claims rest on standard machine-learning evaluation practices.

pith-pipeline@v0.9.0 · 5692 in / 1003 out tokens · 50759 ms · 2026-05-16T20:53:32.300126+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Adapted large language models can outperform medical experts in clinical text summarization,

    D. Van V een, C. Van Uden, L. Blankemeier, J.-B. Delbrouck, A. Aali, C. Bluethgen, A. Pareek, M. Polacin, E. P . Reis, A. Seehofnerov ´a, N. Rohatgi, P . Hosamani, W. Collins, N. Ahuja, C. P . Langlotz, J. Hom, S. Gatidis, J. Pauly , and A. S. Chaudhari, “Adapted large language models can outperform medical experts in clinical text summarization,” Nature ...

  2. [2]

    A generalist medical language model for disease diagnosis assistance,

    X. Liu, H. Liu, G. Yang, Z. Jiang, S. Cui, Z. Zhang, H. Wang, L. Tao, Y . Sun, Z. Song, T. Hong, J. Yang, T. Gao, J. Zhang, X. Li, J. Zhang, Y . Sang, Z. Yang, K. Xue, S. Wu, P . Zhang, J. Yang, C. Song, and G. Wang, “A generalist medical language model for disease diagnosis assistance,” Nature Medicine, vol. 31, no. 3, pp. 932–942, 2025

  3. [3]

    Large language models encode clinical knowledge,

    K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P . Payne, M. Seneviratne, P . Gamble, C. Kelly , A. Babiker, N. Sch¨arli, A. Chowdhery , P . Mansfield, D. Demner-Fushman, B. Ag ¨uera y Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev , Y . Liu, A. Rajkomar, J. B...

  4. [4]

    Toward expert-level medical question answering with large language models,

    K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewis, D. Neal, Q. M. Rashid, M. Schaekermann, A. Wang, D. Dash, J. H. Chen, N. H. Shah, S. Lachgar, P . A. Mansfield, S. Prakash, B. Green, E. Dominowska, B. Ag ¨uera y Arcas, N. Toma ˇsev , Y . Liu, R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, D. R...

  5. [5]

    Towards accurate differential diagnosis with large language models,

    D. McDuff, M. Schaekermann, T. Tu, A. Palepu, A. Wang, J. Garrison, K. Singhal, Y . Sharma, S. Azizi, K. Kulkarni, L. Hou, Y . Cheng, Y . Liu, S. S. Mahdavi, S. Prakash, A. Pathak, C. Semturs, S. Patel, D. R. Webster, E. Dominowska, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, J. Sunshine, A. Karthikesalingam, and V . Natarajan, “Towards ac...

  6. [6]

    Towards conversational diagnostic artificial intelligence,

    T. Tu, M. Schaekermann, A. Palepu, K. Saab, J. Freyberg, R. Tanno, A. Wang, B. Li, M. Amin, Y . Cheng, E. V edadi, N. Tomasev , S. Azizi, K. Singhal, L. Hou, A. Webson, K. Kulkarni, S. S. Mahdavi, C. Semturs, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, A. Karthikesalingam, and V . Natarajan, “Towards conversational diagnostic artificial int...

  7. [7]

    AI-based clinical decision support for primary care: A real-world study ,

    R. Korom, S. Kiptinness, N. Adan, K. Said, C. Ithuli, O. Rotich, B. Kimani, I. King’ori, S. Kamau, E. Atemba et al., “AI-based clinical decision support for primary care: A real-world study ,” arXiv preprint arXiv:2507.16947 , 2025

  8. [8]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Song, P . Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. ...

  9. [9]

    Comparative benchmarking of the deepseek large language model on medical tasks and clinical reasoning,

    M. Tordjman, Z. Liu, M. Yuce, V . Fauveau, Y . Mei, J. Hadjadj, I. Bolger, H. Almansour, C. Horst, A. S. Parihar, A. Geahchan, A. Meribout, N. Yatim, N. Ng, P . Robson, A. Zhou, S. Lewis, M. Huang, T. Deyer, B. Taouli, H.-C. Lee, Z. A. Fayad, and X. Mei, “Comparative benchmarking of the deepseek large language model on medical tasks and clinical reasoning...

  10. [10]

    Benchmark evaluation of deepseek large language models in clinical decision-making,

    S. Sandmann, S. Hegselmann, M. Fujarski, L. Bickmann, B. Wild, R. Eils, and J. Varghese, “Benchmark evaluation of deepseek large language models in clinical decision-making,” Nature Medicine, vol. 31, no. 8, p. 2546–2549, 2025. 9

  11. [11]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V . Fomenko, T. Garipov , K. Georgiev , M. Glaese, T. Gogineni, A. Goucher, L...

  12. [12]

    DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning,

    P . Xu, Y . Wu, K. Jin, X. Chen, M. He, and D. Shi, “DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning,” Advances in Ophthalmology Practice and Research , vol. 5, no. 3, pp. 189–195, 2025. 1 Supplementary ✦ Prompt Prompt 1. Prompt for General Radiology Diagnosis You are a careful radiology diagnosi...

  13. [13]

    Output EXACTLY one option, copied VERBATIM from the list

  14. [14]

    No explanation

    Output ONLY the diagnosis text. No explanation. No punctuation. No quotes. Prompt Prompt 2. Prompts for Ophthalmology QA You are a careful ophthalmology question-answering assistant. You will be given a multiple-choice case with options labeled A–Z. Some questions have a single correct answer, while others have multiple correct answers. Select ALL correct...

  15. [15]

    Most relevant diagnoses not mentioned

  16. [16]

    Many relevant diagnoses missing or incorrectly identified

  17. [17]

    Some relevant diagnoses mentioned, but important omissions or inaccuracies present

  18. [18]

    Most relevant diagnoses correctly identified, with only minor omissions

  19. [19]

    Instruction: Given the following task description, the true disease, and the model output, assign a single integer score from 1 to 5 according to the rubric

    All relevant diagnoses correctly identified. Instruction: Given the following task description, the true disease, and the model output, assign a single integer score from 1 to 5 according to the rubric. Half-point scores (e.g., 1.5, 2.5, 3.5, 4.5) are allowed if the quality falls between two rubric levels. Output only the score , with no explanation or jus...

  20. [20]

    All or most suggested options redundant or unjustified

  21. [21]

    Some suggested options redundant or unjustified

  22. [22]

    Most suggested options appropriate, but minor redundancy or weak justification present

  23. [23]

    Few suggested options redundant or unjustified

  24. [24]

    Instruction: Given the following task description, the true disease, and the model output, assign a score from 1 to 5 according to the rubric

    No suggested options redundant or unjustified. Instruction: Given the following task description, the true disease, and the model output, assign a score from 1 to 5 according to the rubric. Half-point scores (e.g., 1.5, 2.5, 3.5, 4.5) are allowed if the quality falls between two rubric levels. Output only the score , with no explanation or justification. In...

  25. [25]

    Connect symptoms to findings : Link clinical presentation with imaging observations

  26. [26]

    Map to differentials : Show how findings support or contradict each differential diagnosis

  27. [27]

    Systematic elimination : Explicitly rule out less likely options with reasoning

  28. [28]

    Performance is evaluated using self-consistency majority voting accuracy (%, 95% CI)

    Converge to answer : Demonstrate the logical path to the correct diagnosis 4 T ABLE 1 LLM-as-a-Generalist Task: Comparative diagnostic accuracy across radiological anatomical subgroups. Performance is evaluated using self-consistency majority voting accuracy (%, 95% CI). While the proprietary GPT -5 sets the highest overall average performance (88.9% aver...