Benchmarking and Adapting On-Device LLMs for Clinical Decision Support

Alhusain Abdalla; Alif Munim; Bo Wang; Jun Ma; Leo Chen; Omar Ibrahim; Shuolin Yin

arxiv: 2601.03266 · v2 · submitted 2025-12-18 · 💻 cs.CL · cs.AI

Benchmarking and Adapting On-Device LLMs for Clinical Decision Support

Alif Munim , Jun Ma , Omar Ibrahim , Alhusain Abdalla , Shuolin Yin , Leo Chen , Bo Wang This is my paper

Pith reviewed 2026-05-16 20:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords on-device LLMsclinical decision supportfine-tuningmedical diagnosisprivacy-preserving AIbenchmarkingophthalmologydiagnostic accuracy

0 comments

The pith

Fine-tuned on-device LLMs reach 87.9% accuracy on clinical diagnosis, approaching GPT-5.1 at 89.4%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks on-device LLMs from the Qwen3.5, Gemma 4, and gpt-oss families on three clinical tasks: general disease diagnosis, ophthalmology diagnosis and management, and simulation of expert grading. Base models already match or exceed smaller proprietary systems like GPT-5-mini, and fine-tuning the Qwen3.5-35B model lifts general diagnostic accuracy to 87.9 percent. This matters for clinical settings because the models run locally, avoiding the privacy risks and cloud dependency of proprietary services. Error patterns show that nearly all mistakes are medically reasonable alternatives rather than irrelevant outputs. The results indicate that modest adaptation can bring open local models close enough to top closed models for practical use.

Core claim

On-device LLMs achieve performance comparable to or exceeding DeepSeek-R1 and GPT-5-mini across the three tasks despite smaller size. Fine-tuning Qwen3.5-35B on general diagnostic data produces 87.9 percent accuracy, approaching GPT-5.1 at 89.4 percent, while the base Gemma 4 31B model reaches 86.5 percent on general diagnosis. Most errors (87.2 percent) consist of clinically plausible differentials, and upper-bound analysis indicates up to 93.2 percent accuracy is attainable with improved answer selection.

What carries the argument

Benchmarking of on-device LLMs across general diagnosis, specialty ophthalmology tasks, and expert simulation, combined with targeted fine-tuning of models such as Qwen3.5-35B on diagnostic data to close the gap with proprietary systems.

If this is right

On-device models can supply accurate diagnostic support without uploading patient data to external servers.
Fine-tuning delivers large gains, bringing smaller open models within a few points of the strongest closed models.
The bulk of remaining errors are plausible clinical differentials rather than nonsensical predictions.
Local inference removes cloud latency and cost barriers for routine clinical integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Mobile or edge deployment could extend specialist-level decision support to clinics lacking reliable internet.
Further fine-tuning on narrower specialty datasets would likely expand coverage to additional medical domains.
Reduced dependence on proprietary APIs could lower costs and increase accessibility for smaller healthcare providers.

Load-bearing premise

The three chosen clinical tasks and evaluation metrics accurately reflect real-world clinical decision support needs and the fine-tuning data generalizes beyond the tested sets.

What would settle it

Deployment of the fine-tuned models in a live hospital setting where their diagnostic outputs are compared directly against board-certified physicians on new, unseen patient cases.

Figures

Figures reproduced from arXiv: 2601.03266 by Alhusain Abdalla, Alif Munim, Bo Wang, Jun Ma, Leo Chen, Omar Ibrahim, Shuolin Yin.

**Figure 1.** Figure 1: Overview of the benchmark framework. This study compares the on-device LLMs with state-of-the-art open-source and proprietary LLMs across general disease diagnosis, specialty diagnosis and treatment recommendations on ophthalmology multiple-choice questions, and judgment for open-ended clinical decision questions. RESULTS Dataset and evaluation methods We mainly focus on assessing the performance of LLMs o… view at source ↗

**Figure 2.** Figure 2: Zero-shot and fine-tuning performance of on-device LLMs. a, Results of LLM-as-a-generalist: diagnosis accuracy on a wide range of radiological cases (N=207). L, M, and H denote low, medium, and high reasoning efforts, respectively. b, Results of LLM-as-a-specialist: accuracy on ophthalmology cases (N=130) with diagnosis and management multiple-choice questions. c, Results of LLM-as-a-clinical-judge: violin… view at source ↗

**Figure 1.** Figure 1: Parameter efficiency: Fine-tuned gpt-oss-20b model outperforms the 671B DeepSeek-R1. Comparative diagnostic accuracy of the fine-tuned gpt-oss-20b model (green) versus the open-source frontier DeepSeek-R1 (gray). Despite being significantly smaller, the fine-tuned ondevice model achieves a higher overall micro-average accuracy (86.5% vs 81.6%) and demonstrates superior performance in 7 out of 10 anatomica… view at source ↗

**Figure 2.** Figure 2: On-device versus cloud-based efficiency: Fine-tuned model surpasses o4-mini. Performance comparison between the fine-tuned gpt-oss-20b (red) and OpenAI’s efficient proprietary model, o4-mini (gray). The on-device model demonstrates robust generalization, exceeding the cloud-based baseline in overall accuracy (86.5% vs 84.1%) and achieving competitive results across diverse radiological specialties. Vertica… view at source ↗

**Figure 3.** Figure 3: Approaching the frontier: On-device model demonstrates competitive performance with GPT-5. The forest plot illustrates the diagnostic accuracy of the fine-tuned gpt-oss-20b (blue) relative to the state-of-the-art GPT-5 (gray). While GPT-5 maintains a slight lead in overall accuracy (88.9% vs 86.5%), the confidence intervals overlap significantly across the majority of subgroups (e.g., Musculoskeletal, Abdo… view at source ↗

**Figure 4.** Figure 4: Heatmap of diagnostic accuracy across anatomical subgroups and model architectures. Color intensity represents accuracy (Green=High, Red=Low). Base on-device models (middle columns) exhibit significant performance degradation in specialized domains such as Cardiovascular and Breast imaging. Fine-tuning the 20b model (far right column) effectively mitigates these domain-specific weaknesses, restoring perfor… view at source ↗

read the original abstract

Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often have large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark on-device LLMs from the gpt-oss (20b, 120b), Qwen3.5 (9B, 27B, 35B), and Gemma 4 (31B) families across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5.1, GPT-5-mini, and Gemini 3.1 Pro) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b and Qwen3.5-35B on general diagnostic data. Across tasks, on-device models achieve performance comparable to or exceeding DeepSeek-R1 and GPT-5-mini despite being substantially smaller. In addition, fine-tuning remarkably improves diagnostic accuracy, with the fine-tuned Qwen3.5-35B reaching 87.9% and approaching the proprietary GPT-5.1 (89.4%). Among base on-device models, Gemma 4 31B achieved the strongest general diagnostic accuracy at 86.5%, exceeding GPT-5-mini and approaching the fine-tuned Qwen3.5-35B. Error characterization revealed that 87.2% of diagnostic errors across all models were clinically plausible differentials rather than off-topic predictions, and upper-bound analysis showed up to 93.2% attainable accuracy through improved answer selection. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete benchmark numbers for several on-device model families on clinical diagnosis tasks and shows fine-tuning lifts performance close to proprietary baselines, but the adaptation claims rest on thin data documentation.

read the letter

The main thing to know is that this work supplies head-to-head accuracy figures for gpt-oss, Qwen3.5, and Gemma 4 models on three clinical tasks, with fine-tuned Qwen3.5-35B reaching 87.9 percent and Gemma 4 31B base at 86.5 percent, both competitive with GPT-5.1 at 89.4 percent and GPT-5-mini. They also report that most errors across models are clinically plausible differentials and that an upper bound sits at 93.2 percent with better answer selection. That error breakdown and the upper-bound calculation are the parts that feel most useful right now. The privacy and on-device angle is straightforward and relevant for settings where cloud calls are not an option. The comparisons to DeepSeek-R1 and the proprietary models are direct enough to give a sense of relative standing. The fine-tuning experiments on gpt-oss-20b and Qwen3.5-35B using general diagnostic data are the clearest addition to the existing literature on these specific families. The soft spots sit mainly in the methods. The abstract and the reported results give almost no information on where the fine-tuning data came from, how large it was, what the train-test split looked like, or whether any decontamination was done against the evaluation sets. Without those details the 1.5-point gap to GPT-5.1 could reflect leakage or prompt effects rather than real transfer. No sample sizes, confidence intervals, or statistical tests appear for the main accuracy numbers, so it is hard to judge whether the model ordering is stable. The three tasks are reasonable proxies but the paper does not show how well they map onto actual clinical decision workflows or how the grading simulation was validated. This is the kind of paper that belongs in a reading group focused on applied clinical NLP or on-device deployment. Readers who need current numbers for smaller models will get value from the tables even if they treat the fine-tuning claims cautiously. It deserves a serious referee because the empirical scope is practical and the topic is timely, though the review will almost certainly require expanded data sections and basic statistical reporting before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper benchmarks on-device LLMs (gpt-oss 20b/120b, Qwen3.5 9B/27B/35B, Gemma 4 31B) against proprietary models (GPT-5.1, GPT-5-mini, Gemini 3.1 Pro) and DeepSeek-R1 on three clinical tasks: general disease diagnosis, ophthalmology diagnosis/management, and expert grading simulation. It reports that base models like Gemma 4 31B reach 86.5% on general diagnosis, and fine-tuning Qwen3.5-35B on general diagnostic data yields 87.9% accuracy, approaching GPT-5.1 at 89.4%. Error analysis shows 87.2% of errors are clinically plausible differentials, with an upper bound of 93.2% via improved selection.

Significance. If the fine-tuning gains and task results hold after verification of data integrity, the work shows that smaller on-device models can approach proprietary performance in clinical decision support while enabling local, privacy-preserving inference. This has clear practical value for resource-limited settings and demonstrates adaptability of open models via targeted fine-tuning.

major comments (3)

[Abstract (and Methods)] The abstract provides no details on the provenance, size, train/test splits, or decontamination steps for the 'general diagnostic data' used to fine-tune Qwen3.5-35B and gpt-oss-20b. This is load-bearing for the central claim, as unverified overlap with the three evaluation tasks could explain the 87.9% result via leakage rather than genuine adaptation.
[Results] Reported accuracies such as Gemma 4 31B at 86.5% and the fine-tuned Qwen3.5-35B at 87.9% lack sample sizes, variance estimates, confidence intervals, or statistical tests comparing to baselines like GPT-5-mini. This prevents assessment of whether performance orderings are stable.
[Evaluation and Tasks] Task construction details (e.g., exact prompt formats, answer selection criteria, and how the three tasks map to real clinical workflows) are insufficient to evaluate whether the metrics (presumably exact-match or top-1) are clinically meaningful or generalizable.

minor comments (2)

[Introduction and Models] Clarify the precise definition of 'on-device' (e.g., inference hardware constraints) and list exact parameter counts consistently across tables and text.
[Error Analysis] Expand the error characterization section to describe how 'clinically plausible differentials' were annotated and by what criteria.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve transparency and rigor.

read point-by-point responses

Referee: [Abstract (and Methods)] The abstract provides no details on the provenance, size, train/test splits, or decontamination steps for the 'general diagnostic data' used to fine-tune Qwen3.5-35B and gpt-oss-20b. This is load-bearing for the central claim, as unverified overlap with the three evaluation tasks could explain the 87.9% result via leakage rather than genuine adaptation.

Authors: We agree that explicit details on the fine-tuning dataset are necessary to rule out leakage and support the adaptation claim. We will revise both the abstract and Methods section to specify the provenance of the general diagnostic data, its size, the train/test splits, and the decontamination steps taken to eliminate overlap with the evaluation tasks. revision: yes
Referee: [Results] Reported accuracies such as Gemma 4 31B at 86.5% and the fine-tuned Qwen3.5-35B at 87.9% lack sample sizes, variance estimates, confidence intervals, or statistical tests comparing to baselines like GPT-5-mini. This prevents assessment of whether performance orderings are stable.

Authors: We acknowledge the value of statistical context for interpreting the reported accuracies. We will update the Results section to include the exact sample sizes for each task, add confidence intervals or variance estimates, and incorporate appropriate statistical tests (e.g., McNemar's test) for comparisons against baselines such as GPT-5-mini. revision: yes
Referee: [Evaluation and Tasks] Task construction details (e.g., exact prompt formats, answer selection criteria, and how the three tasks map to real clinical workflows) are insufficient to evaluate whether the metrics (presumably exact-match or top-1) are clinically meaningful or generalizable.

Authors: We will expand the Evaluation and Tasks section to provide the exact prompt formats, clarify the answer selection criteria (top-1 exact match), and explicitly map each task to corresponding real-world clinical workflows. This will better demonstrate the clinical relevance and generalizability of the metrics. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmarking study

full rationale

The paper consists entirely of empirical benchmarking of on-device LLMs across three clinical tasks, comparison to proprietary models, and fine-tuning experiments on diagnostic data. No derivations, equations, fitted parameters presented as predictions, or self-citations are used as load-bearing premises. All reported accuracies (e.g., 87.9% for fine-tuned Qwen3.5-35B) are direct measurements from evaluation, with no reduction to inputs by construction. The analysis is self-contained against external model benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a pure empirical benchmarking study. No mathematical derivations, new physical entities, or ad-hoc axioms are introduced; all claims rest on standard machine-learning evaluation practices.

pith-pipeline@v0.9.0 · 5692 in / 1003 out tokens · 50759 ms · 2026-05-16T20:53:32.300126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

Adapted large language models can outperform medical experts in clinical text summarization,

D. Van V een, C. Van Uden, L. Blankemeier, J.-B. Delbrouck, A. Aali, C. Bluethgen, A. Pareek, M. Polacin, E. P . Reis, A. Seehofnerov ´a, N. Rohatgi, P . Hosamani, W. Collins, N. Ahuja, C. P . Langlotz, J. Hom, S. Gatidis, J. Pauly , and A. S. Chaudhari, “Adapted large language models can outperform medical experts in clinical text summarization,” Nature ...

work page 2024
[2]

A generalist medical language model for disease diagnosis assistance,

X. Liu, H. Liu, G. Yang, Z. Jiang, S. Cui, Z. Zhang, H. Wang, L. Tao, Y . Sun, Z. Song, T. Hong, J. Yang, T. Gao, J. Zhang, X. Li, J. Zhang, Y . Sang, Z. Yang, K. Xue, S. Wu, P . Zhang, J. Yang, C. Song, and G. Wang, “A generalist medical language model for disease diagnosis assistance,” Nature Medicine, vol. 31, no. 3, pp. 932–942, 2025

work page 2025
[3]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P . Payne, M. Seneviratne, P . Gamble, C. Kelly , A. Babiker, N. Sch¨arli, A. Chowdhery , P . Mansﬁeld, D. Demner-Fushman, B. Ag ¨uera y Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev , Y . Liu, A. Rajkomar, J. B...

work page 2023
[4]

Toward expert-level medical question answering with large language models,

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewis, D. Neal, Q. M. Rashid, M. Schaekermann, A. Wang, D. Dash, J. H. Chen, N. H. Shah, S. Lachgar, P . A. Mansﬁeld, S. Prakash, B. Green, E. Dominowska, B. Ag ¨uera y Arcas, N. Toma ˇsev , Y . Liu, R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, D. R...

work page 2025
[5]

Towards accurate differential diagnosis with large language models,

D. McDuff, M. Schaekermann, T. Tu, A. Palepu, A. Wang, J. Garrison, K. Singhal, Y . Sharma, S. Azizi, K. Kulkarni, L. Hou, Y . Cheng, Y . Liu, S. S. Mahdavi, S. Prakash, A. Pathak, C. Semturs, S. Patel, D. R. Webster, E. Dominowska, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, J. Sunshine, A. Karthikesalingam, and V . Natarajan, “Towards ac...

work page 2025
[6]

Towards conversational diagnostic artiﬁcial intelligence,

T. Tu, M. Schaekermann, A. Palepu, K. Saab, J. Freyberg, R. Tanno, A. Wang, B. Li, M. Amin, Y . Cheng, E. V edadi, N. Tomasev , S. Azizi, K. Singhal, L. Hou, A. Webson, K. Kulkarni, S. S. Mahdavi, C. Semturs, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, A. Karthikesalingam, and V . Natarajan, “Towards conversational diagnostic artiﬁcial int...

work page 2025
[7]

AI-based clinical decision support for primary care: A real-world study ,

R. Korom, S. Kiptinness, N. Adan, K. Said, C. Ithuli, O. Rotich, B. Kimani, I. King’ori, S. Kamau, E. Atemba et al., “AI-based clinical decision support for primary care: A real-world study ,” arXiv preprint arXiv:2507.16947 , 2025

work page arXiv 2025
[8]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Song, P . Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. ...

work page 2025
[9]

Comparative benchmarking of the deepseek large language model on medical tasks and clinical reasoning,

M. Tordjman, Z. Liu, M. Yuce, V . Fauveau, Y . Mei, J. Hadjadj, I. Bolger, H. Almansour, C. Horst, A. S. Parihar, A. Geahchan, A. Meribout, N. Yatim, N. Ng, P . Robson, A. Zhou, S. Lewis, M. Huang, T. Deyer, B. Taouli, H.-C. Lee, Z. A. Fayad, and X. Mei, “Comparative benchmarking of the deepseek large language model on medical tasks and clinical reasoning...

work page 2025
[10]

Benchmark evaluation of deepseek large language models in clinical decision-making,

S. Sandmann, S. Hegselmann, M. Fujarski, L. Bickmann, B. Wild, R. Eils, and J. Varghese, “Benchmark evaluation of deepseek large language models in clinical decision-making,” Nature Medicine, vol. 31, no. 8, p. 2546–2549, 2025. 9

work page 2025
[11]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V . Fomenko, T. Garipov , K. Georgiev , M. Glaese, T. Gogineni, A. Goucher, L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning,

P . Xu, Y . Wu, K. Jin, X. Chen, M. He, and D. Shi, “DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning,” Advances in Ophthalmology Practice and Research , vol. 5, no. 3, pp. 189–195, 2025. 1 Supplementary ✦ Prompt Prompt 1. Prompt for General Radiology Diagnosis You are a careful radiology diagnosi...

work page 2025
[13]

Output EXACTLY one option, copied VERBATIM from the list

work page
[14]

No explanation

Output ONLY the diagnosis text. No explanation. No punctuation. No quotes. Prompt Prompt 2. Prompts for Ophthalmology QA You are a careful ophthalmology question-answering assistant. You will be given a multiple-choice case with options labeled A–Z. Some questions have a single correct answer, while others have multiple correct answers. Select ALL correct...

work page
[15]

Most relevant diagnoses not mentioned

work page
[16]

Many relevant diagnoses missing or incorrectly identiﬁed

work page
[17]

Some relevant diagnoses mentioned, but important omissions or inaccuracies present

work page
[18]

Most relevant diagnoses correctly identiﬁed, with only minor omissions

work page
[19]

Instruction: Given the following task description, the true disease, and the model output, assign a single integer score from 1 to 5 according to the rubric

All relevant diagnoses correctly identiﬁed. Instruction: Given the following task description, the true disease, and the model output, assign a single integer score from 1 to 5 according to the rubric. Half-point scores (e.g., 1.5, 2.5, 3.5, 4.5) are allowed if the quality falls between two rubric levels. Output only the score , with no explanation or jus...

work page
[20]

All or most suggested options redundant or unjustiﬁed

work page
[21]

Some suggested options redundant or unjustiﬁed

work page
[22]

Most suggested options appropriate, but minor redundancy or weak justiﬁcation present

work page
[23]

Few suggested options redundant or unjustiﬁed

work page
[24]

Instruction: Given the following task description, the true disease, and the model output, assign a score from 1 to 5 according to the rubric

No suggested options redundant or unjustiﬁed. Instruction: Given the following task description, the true disease, and the model output, assign a score from 1 to 5 according to the rubric. Half-point scores (e.g., 1.5, 2.5, 3.5, 4.5) are allowed if the quality falls between two rubric levels. Output only the score , with no explanation or justiﬁcation. In...

work page
[25]

Connect symptoms to ﬁndings : Link clinical presentation with imaging observations

work page
[26]

Map to differentials : Show how ﬁndings support or contradict each differential diagnosis

work page
[27]

Systematic elimination : Explicitly rule out less likely options with reasoning

work page
[28]

Performance is evaluated using self-consistency majority voting accuracy (%, 95% CI)

Converge to answer : Demonstrate the logical path to the correct diagnosis 4 T ABLE 1 LLM-as-a-Generalist Task: Comparative diagnostic accuracy across radiological anatomical subgroups. Performance is evaluated using self-consistency majority voting accuracy (%, 95% CI). While the proprietary GPT -5 sets the highest overall average performance (88.9% aver...

work page

[1] [1]

Adapted large language models can outperform medical experts in clinical text summarization,

D. Van V een, C. Van Uden, L. Blankemeier, J.-B. Delbrouck, A. Aali, C. Bluethgen, A. Pareek, M. Polacin, E. P . Reis, A. Seehofnerov ´a, N. Rohatgi, P . Hosamani, W. Collins, N. Ahuja, C. P . Langlotz, J. Hom, S. Gatidis, J. Pauly , and A. S. Chaudhari, “Adapted large language models can outperform medical experts in clinical text summarization,” Nature ...

work page 2024

[2] [2]

A generalist medical language model for disease diagnosis assistance,

X. Liu, H. Liu, G. Yang, Z. Jiang, S. Cui, Z. Zhang, H. Wang, L. Tao, Y . Sun, Z. Song, T. Hong, J. Yang, T. Gao, J. Zhang, X. Li, J. Zhang, Y . Sang, Z. Yang, K. Xue, S. Wu, P . Zhang, J. Yang, C. Song, and G. Wang, “A generalist medical language model for disease diagnosis assistance,” Nature Medicine, vol. 31, no. 3, pp. 932–942, 2025

work page 2025

[3] [3]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P . Payne, M. Seneviratne, P . Gamble, C. Kelly , A. Babiker, N. Sch¨arli, A. Chowdhery , P . Mansﬁeld, D. Demner-Fushman, B. Ag ¨uera y Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev , Y . Liu, A. Rajkomar, J. B...

work page 2023

[4] [4]

Toward expert-level medical question answering with large language models,

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewis, D. Neal, Q. M. Rashid, M. Schaekermann, A. Wang, D. Dash, J. H. Chen, N. H. Shah, S. Lachgar, P . A. Mansﬁeld, S. Prakash, B. Green, E. Dominowska, B. Ag ¨uera y Arcas, N. Toma ˇsev , Y . Liu, R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, D. R...

work page 2025

[5] [5]

Towards accurate differential diagnosis with large language models,

D. McDuff, M. Schaekermann, T. Tu, A. Palepu, A. Wang, J. Garrison, K. Singhal, Y . Sharma, S. Azizi, K. Kulkarni, L. Hou, Y . Cheng, Y . Liu, S. S. Mahdavi, S. Prakash, A. Pathak, C. Semturs, S. Patel, D. R. Webster, E. Dominowska, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, J. Sunshine, A. Karthikesalingam, and V . Natarajan, “Towards ac...

work page 2025

[6] [6]

Towards conversational diagnostic artiﬁcial intelligence,

T. Tu, M. Schaekermann, A. Palepu, K. Saab, J. Freyberg, R. Tanno, A. Wang, B. Li, M. Amin, Y . Cheng, E. V edadi, N. Tomasev , S. Azizi, K. Singhal, L. Hou, A. Webson, K. Kulkarni, S. S. Mahdavi, C. Semturs, J. Gottweis, J. Barral, K. Chou, G. S. Corrado, Y . Matias, A. Karthikesalingam, and V . Natarajan, “Towards conversational diagnostic artiﬁcial int...

work page 2025

[7] [7]

AI-based clinical decision support for primary care: A real-world study ,

R. Korom, S. Kiptinness, N. Adan, K. Said, C. Ithuli, O. Rotich, B. Kimani, I. King’ori, S. Kamau, E. Atemba et al., “AI-based clinical decision support for primary care: A real-world study ,” arXiv preprint arXiv:2507.16947 , 2025

work page arXiv 2025

[8] [8]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Song, P . Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. ...

work page 2025

[9] [9]

Comparative benchmarking of the deepseek large language model on medical tasks and clinical reasoning,

M. Tordjman, Z. Liu, M. Yuce, V . Fauveau, Y . Mei, J. Hadjadj, I. Bolger, H. Almansour, C. Horst, A. S. Parihar, A. Geahchan, A. Meribout, N. Yatim, N. Ng, P . Robson, A. Zhou, S. Lewis, M. Huang, T. Deyer, B. Taouli, H.-C. Lee, Z. A. Fayad, and X. Mei, “Comparative benchmarking of the deepseek large language model on medical tasks and clinical reasoning...

work page 2025

[10] [10]

Benchmark evaluation of deepseek large language models in clinical decision-making,

S. Sandmann, S. Hegselmann, M. Fujarski, L. Bickmann, B. Wild, R. Eils, and J. Varghese, “Benchmark evaluation of deepseek large language models in clinical decision-making,” Nature Medicine, vol. 31, no. 8, p. 2546–2549, 2025. 9

work page 2025

[11] [11]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V . Fomenko, T. Garipov , K. Georgiev , M. Glaese, T. Gogineni, A. Goucher, L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning,

P . Xu, Y . Wu, K. Jin, X. Chen, M. He, and D. Shi, “DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning,” Advances in Ophthalmology Practice and Research , vol. 5, no. 3, pp. 189–195, 2025. 1 Supplementary ✦ Prompt Prompt 1. Prompt for General Radiology Diagnosis You are a careful radiology diagnosi...

work page 2025

[13] [13]

Output EXACTLY one option, copied VERBATIM from the list

work page

[14] [14]

No explanation

Output ONLY the diagnosis text. No explanation. No punctuation. No quotes. Prompt Prompt 2. Prompts for Ophthalmology QA You are a careful ophthalmology question-answering assistant. You will be given a multiple-choice case with options labeled A–Z. Some questions have a single correct answer, while others have multiple correct answers. Select ALL correct...

work page

[15] [15]

Most relevant diagnoses not mentioned

work page

[16] [16]

Many relevant diagnoses missing or incorrectly identiﬁed

work page

[17] [17]

Some relevant diagnoses mentioned, but important omissions or inaccuracies present

work page

[18] [18]

Most relevant diagnoses correctly identiﬁed, with only minor omissions

work page

[19] [19]

Instruction: Given the following task description, the true disease, and the model output, assign a single integer score from 1 to 5 according to the rubric

All relevant diagnoses correctly identiﬁed. Instruction: Given the following task description, the true disease, and the model output, assign a single integer score from 1 to 5 according to the rubric. Half-point scores (e.g., 1.5, 2.5, 3.5, 4.5) are allowed if the quality falls between two rubric levels. Output only the score , with no explanation or jus...

work page

[20] [20]

All or most suggested options redundant or unjustiﬁed

work page

[21] [21]

Some suggested options redundant or unjustiﬁed

work page

[22] [22]

Most suggested options appropriate, but minor redundancy or weak justiﬁcation present

work page

[23] [23]

Few suggested options redundant or unjustiﬁed

work page

[24] [24]

Instruction: Given the following task description, the true disease, and the model output, assign a score from 1 to 5 according to the rubric

No suggested options redundant or unjustiﬁed. Instruction: Given the following task description, the true disease, and the model output, assign a score from 1 to 5 according to the rubric. Half-point scores (e.g., 1.5, 2.5, 3.5, 4.5) are allowed if the quality falls between two rubric levels. Output only the score , with no explanation or justiﬁcation. In...

work page

[25] [25]

Connect symptoms to ﬁndings : Link clinical presentation with imaging observations

work page

[26] [26]

Map to differentials : Show how ﬁndings support or contradict each differential diagnosis

work page

[27] [27]

Systematic elimination : Explicitly rule out less likely options with reasoning

work page

[28] [28]

Performance is evaluated using self-consistency majority voting accuracy (%, 95% CI)

Converge to answer : Demonstrate the logical path to the correct diagnosis 4 T ABLE 1 LLM-as-a-Generalist Task: Comparative diagnostic accuracy across radiological anatomical subgroups. Performance is evaluated using self-consistency majority voting accuracy (%, 95% CI). While the proprietary GPT -5 sets the highest overall average performance (88.9% aver...

work page