pith. sign in

arxiv: 2605.16215 · v1 · pith:RGJBW7T4new · submitted 2026-05-15 · 💻 cs.AI · cs.CL

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Pith reviewed 2026-05-20 18:48 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords clinical LLMsfully open modelsauditable AI pipelinesmedical benchmarkssynthetic clinical dataclinician validationLLM evaluation
0
0 comments X

The pith

Fully open pipelines for clinical LLMs achieve state-of-the-art performance while exposing every step for audit and reproduction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first complete open pipeline for training clinical large language models. It starts with public medical question-answering datasets, adds synthetic questions and vignettes reviewed by clinicians, removes contaminated examples, and tests the resulting models with a judge protocol calibrated to human doctors. A reader would care because most medical AI tools hide their training data and processes, which makes independent verification difficult in high-stakes settings. This work shows that transparency does not have to come at the cost of capability.

Core claim

We introduce Fully Open Meditron as an end-to-end auditable pipeline that normalizes eight public medical QA datasets into conversational format, augments them with clinician-vetted synthetic extensions from 46,469 clinical practice guidelines and vignettes, applies system-wide decontamination and gold-label resampling, and validates outputs with a four-physician panel and an LLM-as-a-judge protocol calibrated against 204 human raters. Applying this recipe to open base models yields variants that are preferred over their bases and, in some cases, over existing closed medical models on benchmarks and vignette comparisons.

What carries the argument

The Fully Open Meditron pipeline, which combines data unification, clinician auditing of synthetic extensions, decontamination, and use-aligned evaluation to produce reproducible clinical LLMs.

If this is right

  • Open-weight models gain substantial medical capability when trained on this audited corpus.
  • The pipeline works across different base model sizes and families.
  • Evaluation can rely on calibrated LLM judges rather than always needing full human review.
  • Clinical decision support systems can be built with complete data provenance and reproducibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar pipelines might apply to other high-stakes domains where transparency matters as much as accuracy.
  • Reducing dependence on proprietary data could accelerate development of specialized models in medicine.
  • Real-world deployment tests would reveal whether benchmark gains translate to better patient outcomes.

Load-bearing premise

The clinician-created synthetic questions and vignettes accurately represent real clinical situations without adding systematic errors or biases that affect model decisions.

What would settle it

Running the open models on a held-out set of actual anonymized patient records and finding higher rates of incorrect or unsafe advice compared to closed models.

Figures

Figures reproduced from arXiv: 2605.16215 by David Sasu, Fay Elhassan, Lars Klein, Mary-Anne Hartley, Mushtaha El-Amin, Sahaj Vaidya, Victor Cartier-Negadi, Xavier Theimer-Lienhard.

Figure 1
Figure 1. Figure 1: Evolution of medical LLM performance on Healthbench over time across closed-data, open [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Fully Open Meditron Corpus construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Fully Open Meditron datasets in records count. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Auto-MOOVE pairwise preference results. For each prompt drawn from the MOOVE evaluation split, two model responses are evaluated by Qwen3-235B-A22B which assigns a winner (Model 1, Model 2, or Tie). Bars show the share of prompts on which each model wins, ties, or loses (N = 12,602 comparisons per pair). Judge agreement with a 204-rater human panel was validated prior to use; see App. H. (Left: Each Fully … view at source ↗
Figure 5
Figure 5. Figure 5: Per-criterion Auto-MOOVE Likert profiles for Fully Open Meditron models versus corre [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Synthetic MOOVE vs. source (nsrc = 24,679, nsyn = 24,465). Top specialties preserved in rank; difficulty shifts toward levels 4–5. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Guidelines QA vs. source (nsrc = 16,300, nsyn = 145,681, a ∼9× amplification). Difficulty is not comparable for this component, since the source consists of clinical practice guidelines rather than question– answer pairs. Both annotated axes closely match the source (JSD ≤ 0.014). Unspecified Infectious disease Neurology Gastroenterology Endocrinology Pediatrics Obstetrics General medicine Cardiology Ophth… view at source ↗
Figure 8
Figure 8. Figure 8: Synthetic Curated QA vs. source (nsrc = 211,244, nsyn = 214,654). The generator broadens coverage from the eight aggregated source datasets, promoting under-represented specialties; difficulty shift is 2.81 → 3.55. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of per-rater κ values across the 204-rater human panel, with the Auto-MOOVE judge’s κ situated within it. The judge falls within ±2σ of the human mean under both with-ties and no-ties scoring, indicating it is statistically indistinguishable from a typical human rater on this validation set. I Training details I.1 Infrastructure and framework All Fully Open Meditron models were trained on a hi… view at source ↗
Figure 10
Figure 10. Figure 10: Medical LLM Openness Tiers 30 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
read the original abstract

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Fully Open Meditron as the first fully open pipeline for clinical LLMs. It comprises a clinician-audited corpus that unifies eight public medical QA datasets into conversational format and augments them with three synthetic extensions (exam-style QA, guideline-grounded QA from 46,469 clinical practice guidelines, and clinical vignettes), a reproducible data-construction and training framework with system-wide decontamination and gold-label resampling, and a use-aligned evaluation protocol using LLM-as-a-judge on expert-written vignettes calibrated against 204 human raters. The recipe is applied to five fully open base models; reported results include a +6.6-point aggregate-benchmark gain for Apertus-70B-MeditronFO and a 58.6% preference rate for Gemma-3-27B-MeditronFO over MedGemma (with 58% vs 55.9% on HealthBench). The central claim is that fully open pipelines can reach domain-specific state-of-the-art performance while preserving auditability and reproducibility.

Significance. If the performance gains are attributable to genuine capability rather than training-distribution artifacts, the work is significant for establishing a concrete, end-to-end auditable recipe that closes the gap between open-weight and fully open models in medicine. Strengths include the explicit clinician auditing by a four-physician panel, the unification of public datasets with guideline-derived synthetic data, and the emphasis on decontamination and reproducible evaluation. These elements directly address the opacity problem in current LLM-based clinical decision support and provide a template that other groups can replicate or extend.

major comments (2)
  1. [Data Construction] Data Construction section: The central claim that the pipeline achieves genuine domain-specific gains rests on the assumption that the clinician-vetted synthetic extensions (especially the 46,469 guideline-grounded QA items and vignettes) faithfully represent real-world clinical distributions. The manuscript describes four-physician auditing and decontamination but provides no quantitative comparison (e.g., Kolmogorov-Smirnov tests or comorbidity-frequency tables) of the generated data against real clinical query logs or EHR statistics. Without such validation, the reported +6.6-point benchmark improvement and 58.6% preference rate could partly reflect distributional overlap with the evaluation vignettes rather than improved generalization.
  2. [Evaluation Protocol] Evaluation Protocol section: The LLM-as-a-judge protocol is calibrated on 204 human raters and applied to expert-written clinical vignettes, yet the training corpus contains similar vignette-style and guideline-derived synthetic data. The manuscript does not report a hold-out test on prospective clinical outcomes or external real-world logs, leaving open the possibility that the 58% vs 55.9% HealthBench result and overall preference scores are inflated by shared generative processes. A concrete external validation set would be required to support the claim that the gains reflect true capability rather than evaluation calibration.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'system-wide decontamination' is used without a brief parenthetical description of the exact procedure or exclusion criteria; adding one sentence would improve immediate clarity for readers.
  2. [Methods] Notation: The manuscript refers to 'FO base models' and 'MeditronFO variants' without an explicit glossary or table defining the five base models and their corresponding fine-tuned names; a small nomenclature table would reduce ambiguity.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, providing honest responses based on the scope and constraints of our work. Where feasible, we have revised the manuscript to incorporate clarifications and additional discussion.

read point-by-point responses
  1. Referee: [Data Construction] The manuscript describes four-physician auditing and decontamination but provides no quantitative comparison (e.g., Kolmogorov-Smirnov tests or comorbidity-frequency tables) of the generated data against real clinical query logs or EHR statistics. Without such validation, the reported +6.6-point benchmark improvement and 58.6% preference rate could partly reflect distributional overlap with the evaluation vignettes rather than improved generalization.

    Authors: We agree that quantitative distributional comparisons to real-world clinical logs would provide stronger evidence of representativeness. However, such logs and EHR statistics are not publicly available due to privacy regulations, preventing direct statistical tests like Kolmogorov-Smirnov or comorbidity tables from external sources. Our validation instead centers on systematic review by a four-physician panel, as described in the manuscript, combined with system-wide decontamination. We have added a dedicated limitations paragraph in the revised Data Construction section that discusses the representativeness of guideline-derived data, reports coverage statistics from the 46,469 guidelines, and notes the distinction between training distributions and evaluation benchmarks. Performance gains on multiple held-out medical benchmarks support generalization beyond any potential overlap. revision: partial

  2. Referee: [Evaluation Protocol] The LLM-as-a-judge protocol is calibrated on 204 human raters and applied to expert-written clinical vignettes, yet the training corpus contains similar vignette-style and guideline-derived synthetic data. The manuscript does not report a hold-out test on prospective clinical outcomes or external real-world logs, leaving open the possibility that the 58% vs 55.9% HealthBench result and overall preference scores are inflated by shared generative processes. A concrete external validation set would be required to support the claim that the gains reflect true capability rather than evaluation calibration.

    Authors: We appreciate the concern regarding potential calibration effects. HealthBench is an independent, externally developed benchmark that was not generated by our pipeline or synthetic processes. The LLM-as-a-judge protocol was explicitly calibrated against 204 human raters to align with expert clinical judgment, and we report results on both this protocol and standard aggregate medical benchmarks. We have revised the Evaluation Protocol section to more explicitly delineate the scope of our claims, clarify that results reflect benchmark performance rather than live clinical deployment, and state that prospective outcome validation lies outside the current computational study. We maintain that the combination of decontamination, distinct evaluation vignettes, and human calibration supports the reported gains as reflecting improved capability. revision: yes

standing simulated objections not resolved
  • Quantitative distributional analysis against real clinical query logs or EHR data, as these are not publicly accessible due to privacy regulations.
  • Prospective validation on real-world clinical outcomes, which would require IRB approval, live deployment, and access to patient data beyond the scope of this paper.

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external benchmarks

full rationale

The paper's central claims rest on measured performance improvements (+6.6 points on aggregate medical benchmarks, 58% vs 55.9% on HealthBench) obtained by applying a data-construction pipeline to five base models and evaluating the resulting models against independent public QA datasets and a human-calibrated LLM-as-a-judge protocol. No derivation, equation, or first-principles step reduces to its own fitted inputs or self-citations; the synthetic extensions are generated from external guidelines, decontaminated, and then tested on separate vignettes and benchmarks whose distributions are not defined by the pipeline itself. The evaluation protocol is calibrated on 204 external human raters rather than on the training data, rendering the reported preference rates falsifiable outside the construction process.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on domain assumptions about data quality rather than new mathematical constructs or fitted parameters.

axioms (2)
  • domain assumption Clinician audits and vetting of synthetic data from guidelines produce high-fidelity, unbiased training examples representative of clinical practice.
    Invoked in the description of the three clinician-vetted synthetic extensions and the four-physician panel validation.
  • domain assumption LLM-as-a-judge scores calibrated on 204 human raters provide a reliable proxy for clinical quality on expert-written vignettes.
    Central to the use-aligned evaluation protocol.

pith-pipeline@v0.9.0 · 5922 in / 1473 out tokens · 52539 ms · 2026-05-20T18:48:44.861286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 11 internal anchors

  1. [1]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

  2. [2]

    MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

    Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079, 2023

  3. [3]

    Biomistral: A collection of open-source pretrained large language models for medical domains

    Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. InFindings of the association for computational linguistics: acl 2024, pages 5848–5864, 2024

  4. [4]

    Medical large language models are vulnerable to data-poisoning attacks.Nature Medicine, 31(2):618–626, 2025

    Daniel Alexander Alber, Zihao Yang, Anton Alyakin, Eunice Yang, Sumedha Rai, Aly A Valliani, Jeff Zhang, Gabriel R Rosenbaum, Ashley K Amend-Thomas, David B Kurland, et al. Medical large language models are vulnerable to data-poisoning attacks.Nature Medicine, 31(2):618–626, 2025. 10

  5. [5]

    Training large language models on narrow tasks can lead to broad misalignment.Nature, 649(8097):584–589, 2026

    Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martín Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, 649(8097):584–589, 2026

  6. [6]

    Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.Journal of the American Medical Informatics Association, 32(6):1015–1024, 2025

    Felix J Dorfner, Amin Dada, Felix Busch, Marcus R Makowski, Tianyu Han, Daniel Truhn, Jens Kleesiek, Madhumita Sushil, Lisa C Adams, and Keno K Bressem. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.Journal of the American Medical Informatics Association, 32(6):1015–1024, 2025

  7. [7]

    Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

  8. [8]

    Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

  9. [9]

    Capabilities of Gemini Models in Medicine

    Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

  10. [10]

    Huatuogpt-ii, one-stage training for medical adaption of llms.arXiv preprint arXiv:2311.09774, 2023

    Junying Chen, Xidong Wang, Ke Ji, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, et al. Huatuogpt-ii, one-stage training for medical adaption of llms.arXiv preprint arXiv:2311.09774, 2023

  11. [11]

    Pmc- llama: toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 2024

    Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. Pmc- llama: toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 2024

  12. [12]

    Llama-3-meditron: An open-weight suite of medical llms based on llama-3.1

    Alexandre Sallinen, Antoni-Joan Solergibert, Michael Zhang, Guillaume Boyé, Maud Dupont- Roc, Xavier Theimer-Lienhard, Etienne Boisson, Bastien Bernath, Hichem Hadhri, Antoine Tran, et al. Llama-3-meditron: An open-weight suite of medical llms based on llama-3.1. In Workshop on Large Language Models and Generative AI for Health at AAAI 2025, 2025

  13. [13]

    Investigating data contamination in modern benchmarks for large language models

    Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8706–8719, 2024

  14. [14]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  15. [15]

    Truthfulqa: Measuring how models mimic hu- man falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  16. [16]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

  17. [17]

    Winogrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  18. [18]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  19. [19]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018. 11

  20. [20]

    Time travel in llms: Tracing data contamination in large language models.arXiv preprint arXiv:2308.08493, 2023

    Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models.arXiv preprint arXiv:2308.08493, 2023

  21. [21]

    Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,

    Project Apertus, Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Ange- lika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Ed- uard Frank ˇDurech, et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233, 2025

  22. [22]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

  23. [23]

    Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

  24. [24]

    Liveclin: A live clinical benchmark without leakage.arXiv preprint arXiv:2602.16747, 2026

    Xidong Wang, Shuqi Guo, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, and Benyou Wang. Liveclin: A live clinical benchmark without leakage.arXiv preprint arXiv:2602.16747, 2026

  25. [25]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36. Neural Information Processing Systems Foundation, 2023

  26. [26]

    Judging the judges: Evaluating alignment and vulnerabilities in LLMs- as-judges

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in LLMs- as-judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM), pages 404–430, Vienna, Austria, July 2025. Association for Computational Linguistics

  27. [27]

    Judge’s verdict: A comprehensive analysis of llm judge capability through human agreement.arXiv preprint arXiv:2510.09738, 2025

    Steve Han, Gilberto Titericz Junior, Tom Balough, and Wenfei Zhou. Judge’s verdict: A comprehensive analysis of llm judge capability through human agreement.arXiv preprint arXiv:2510.09738, 2025

  28. [28]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  29. [29]

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering, 2022

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering, 2022

  30. [30]

    Pubmedqa: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019

  31. [31]

    Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial intelligence in medicine, 155:102938, 2024

    Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial intelligence in medicine, 155:102938, 2024

  32. [32]

    Overview of the medical question answering task at trec 2017 liveqa

    Asma Ben Abacha, Eugene Agichtein, Yuval Pinter, and Dina Demner-Fushman. Overview of the medical question answering task at trec 2017 liveqa. InTREC, pages 1–12, 2017

  33. [33]

    Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset

    Tobi Olatunji, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, et al. Afrimed-qa: a pan-african, multi-specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640, 2024

  34. [34]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. 12

  35. [35]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

  36. [36]

    Eurollm-22b: Technical report.arXiv preprint arXiv:2602.05879, 2026

    Miguel Moura Ramos, Duarte M Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Hen- rique Martins, Patrick Fernandes, José Pombal, Nuno M Guerreiro, Ricardo Rei, Nicolas Boizard, et al. Eurollm-22b: Technical report.arXiv preprint arXiv:2602.05879, 2026

  37. [37]

    Gemma 3 technical report, 2025

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report, 2025

  38. [38]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025

  39. [39]

    MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

    Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

  40. [40]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  41. [41]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  42. [42]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  43. [43]

    Medical foundation large language models for comprehensive text analysis and beyond.npj Digital Medicine, 8:141, 2025

    Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu, Fongci Lin, Xueqing Peng, Jimin Huang, Jeffrey Zhang, Vipina Keloth, Xinyu Zhou, Lingfei Qian, Huan He, Dennis Shung, Lucila Ohno-Machado, Yonghui Wu, Hua Xu, and Jiang Bian. Medical foundation large language models for comprehensive text analysis and beyond.npj Digital Medicine, 8:141, 2025

  44. [44]

    Tulu 3: Pushing frontiers in open language model post-training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. InProceedings of the International Conference on Learning Representations, 2025

  45. [45]

    Matt White, Ibrahim Haddad, Cailean Osborne, Xiao-Yang Yanglet Liu, Ahmed Abdelmonsef, Sachin Varghese, and Arnaud Le Hors. The model openness framework: Promoting complete- ness and openness for reproducibility, transparency, and usability in artificial intelligence.arXiv preprint arXiv:2403.13784, 2024. 13 A Examples where Gemma-Meditron wins against Me...

  46. [46]

    It is fast, widely available, and provides detailed anatomical information about solid organs, vasculature, retroperitoneum, and bony structures

    Imaging Modality: Contrast-Enhanced CT Scan *Why CT?CT is the gold standard for evaluating hemodynamically unstable trauma patients with suspected abdominal or thoracic injury. It is fast, widely available, and provides detailed anatomical information about solid organs, vasculature, retroperitoneum, and bony structures. *Why Contrast Enhanced?While a non...

  47. [47]

    The protocol should be streamlined and focused

    Acquisition Protocol: Trauma CT Protocol (Optimized for Hemodynamic Instability) Given the patient’s hemodynamic instability, speed and efficiency are paramount. The protocol should be streamlined and focused. *Scanner:Use a high-speed, multi-detector CT scanner (64-slice or higher preferred). *Patient Positioning:Supine position with arms above the head....

  48. [48]

    *Splenic Injury:The spleen is the most commonly injured solid organ in blunt abdominal trauma

    Most Likely Diagnosis: Given the mechanism (bicycle accident), hemodynamic instability, and severe upper abdominal pain, the most likely diagnoses involve significantintra-abdominal hemorrhage. *Splenic Injury:The spleen is the most commonly injured solid organ in blunt abdominal trauma. Rupture or laceration can lead to massive hemoperitoneum and shock. ...

  49. [49]

    The patient’s GCS of 15 allows for a quick, whole-body CT scan

    Imaging Modality: Contrast-Enhanced Multidetector CT (MDCT) of the Abdomen and Pelvis – FAST-CT Protocol Why CT, not US?While bedside ultrasound (FAST exam) isalwaysfirst-line in the ED for the "ABCDE" assessment and to detect free fluid, in a patient who is already being taken for definitive imaging, and who hasspecific* abdominal pain, a CT offers a far...

  50. [50]

    Model 1” and “Model 2

    Acquisition Protocol (FAST-CT – Adapted for CHUV resources) | Parameter | Recommended Setting (CHUV-typical) | Rationale | |———–|————————————- |———–| |Scanner| 128- or 256-slice (or higher) | Faster rotation, thinner slices | |Gantry Tilt| 0-15° (depending on patient habitus) | Improves visualization of the upper abdomen | |Inspiration Breath-Hold | 10-15...