arxiv: 2604.10535 · v1 · submitted 2026-04-12 · 💻 cs.IR · cs.CL

Recognition: unknown

Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

Avi-ad Avraam Buskila

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:04 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords medical question answeringLLM reproducibilityoutput consistencysmall language modelsevaluation frameworkMedQuADBERTScoreROUGE-L

0 comments

The pith

Small open LLMs generate unique answers to the same medical question in 87-97 percent of cases even at low temperature.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that medical question answering with LLMs requires treating output consistency as a core requirement alongside accuracy, because inconsistent responses can spread conflicting information in online health communities. It supplies a complete open-source pipeline that runs each question ten times, then scores the outputs with eight quality measures such as BERTScore and ROUGE-L plus two reproducibility statistics derived from the repeated runs. When the pipeline is applied to three small models on fifty MedQuAD questions, self-agreement never exceeds 0.20 and the great majority of answers differ from one another. The clinically fine-tuned smallest model performs worse than the larger general-purpose models on both quality and consistency, although size and fine-tuning remain entangled. These results indicate that single-run benchmarks overlook a practical safety limitation for any deployment that aims to serve users seeking reliable medical information.

Core claim

The central claim is that low-temperature generation (T=0.2) still produces at most 0.20 self-agreement across ten runs per question, with 87-97 percent of all outputs per model being unique. The evaluation framework computes eight quality metrics (BERTScore, ROUGE-L, an LLM-as-judge rubric, and others) together with two within-model reproducibility metrics from the repeated inferences. Applied to Llama 3.1 8B, Gemma 3 12B, and MedGemma 1.5 4B on the fifty MedQuAD questions, the clinically fine-tuned MedGemma underperforms the larger general models on both quality and reproducibility, though the comparison mixes domain adaptation with model scale. The full methodology is described so that it

What carries the argument

The open-source evaluation pipeline that generates N=10 low-temperature responses per question and extracts two within-model reproducibility metrics from the set of outputs, used in parallel with lexical, semantic, and LLM-judge quality metrics.

If this is right

Model selection for medical use must include repeated-run reproducibility checks rather than single-pass accuracy alone.
The open pipeline lets practitioners compare any small open LLM on both correctness and stability for their own deployment needs.
Clinically fine-tuned models cannot be assumed to improve consistency without isolating scale from domain adaptation.
Low-temperature sampling by itself does not deliver the output stability required for reliable medical assistants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the low agreement persists across other question sets, then additional techniques such as prompt engineering or output ensembling would be needed before safe deployment in health forums.
Future tests that match model sizes while varying only the fine-tuning data could separate the effects of clinical adaptation on reproducibility.
Applying the same repeated-run protocol to actual user questions from online communities rather than curated datasets could expose different consistency behavior.

Load-bearing premise

The fifty MedQuAD questions together with ten runs each and the chosen lexical semantic and judge metrics are sufficient to represent the consistency needs of medical question answering in real online health communities.

What would settle it

Repeating the evaluation on a substantially larger or more diverse collection of medical questions and obtaining self-agreement above 0.5 or uniqueness below 70 percent would indicate that the reported gap does not hold under broader conditions.

read the original abstract

Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open-source evaluation framework for assessing small, locally-deployable open-weight LLMs on medical question answering, treating reproducibility as a first-class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE-L, and an LLM-as-judge rubric, together with two within-model reproducibility metrics derived from repeated inference (N=10 runs per question). Evaluating three models (Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B) on 50 MedQuAD questions (N=1,500 total responses) reveals that despite low-temperature generation (T=0.2), self-agreement across runs reaches at most 0.20, while 87-97% of all outputs per model are unique -- a safety gap that single-pass benchmarks entirely miss. The clinically fine-tuned MedGemma 1.5 4B underperforms the larger general-purpose models on both quality and reproducibility; however, because MedGemma is also the smallest model, this comparison confounds domain fine-tuning with model scale. We describe the methodology in sufficient detail for practitioners to replicate or extend the evaluation for their own model-selection workflows. All code and data pipelines are available at https://github.com/aviad-buskila/llm_medical_reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reproducibility checks on repeated low-temp runs expose real inconsistency in small medical LLMs, but the 50 clean MedQuAD questions limit how far the safety claims extend to messier real queries.

read the letter

The main point is that even with temperature at 0.2 these models give substantively different answers across ten runs on the same medical question, and the paper packages a straightforward open pipeline to measure that alongside accuracy scores. Treating within-model agreement as a core metric rather than an optional check is the practical addition here. The code, the eight metrics including BERTScore, ROUGE-L and the LLM judge, and the full set of 1,500 responses are all released, so anyone can rerun or extend it on their own models or data. They also note that the smallest fine-tuned model underperformed the larger general ones, which is useful raw data even if size and tuning are confounded. The setup is described clearly enough that a practitioner could copy the workflow for model selection. The soft spot is the narrow base of 50 MedQuAD questions. These are short, factual, and well-formed; typical Reddit-style health queries are often ambiguous, personal, or symptom-driven, so the reported self-agreement ceiling of 0.20 and 87-97 percent unique outputs could shift under more realistic conditions. The abstract gives aggregate numbers without per-question spreads, bootstrap intervals, or direct comparison of agreement distributions, which leaves the precision of the safety-gap claim harder to judge. This work is aimed at engineers and small-team researchers who need to pick or audit locally deployable LLMs for health-related use. It is worth sending for peer review. The core empirical observation is straightforward, the artifacts are open, and the issue it flags matters for deployment safety. Reviewers can ask for a broader question sample and tighter statistical reporting, but the basic contribution stands on its own terms.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a practical open-source evaluation framework for small open-weight LLMs on medical question answering, treating reproducibility as a first-class metric. It evaluates Llama 3.1 8B, Gemma 3 12B, and MedGemma 1.5 4B on 50 MedQuAD questions (N=10 repeated inferences each at T=0.2, totaling 1,500 responses) using eight metrics including BERTScore, ROUGE-L, and an LLM-as-judge rubric, plus two within-model reproducibility metrics. Key findings are that self-agreement reaches at most 0.20 and 87-97% of outputs per model are unique, exposing a safety gap missed by single-pass benchmarks; MedGemma underperforms but the comparison confounds fine-tuning with scale. All code and pipelines are publicly released.

Significance. If the low-reproducibility findings hold under the reported conditions, the work is significant for identifying consistency failures in LLM medical QA that standard accuracy benchmarks overlook, with direct relevance to misinformation risks in online health communities. The explicit provision of reproducible code, detailed methodology, and parameter-free empirical metrics is a clear strength that supports practitioner adoption and extension. Significance is reduced by the narrow empirical base, but the framework itself offers a useful template for future model-selection workflows.

major comments (3)

[Results] Results section: The central claim that self-agreement reaches 'at most 0.20' and 87-97% of outputs are unique is reported only in aggregate without per-question variance, bootstrap confidence intervals, or statistical tests on the agreement distributions. This directly affects the precision and robustness of the safety-gap argument given N=10 runs.
[Methodology] Methodology section: The selection of the 50 MedQuAD questions is not accompanied by justification, diversity analysis, or sensitivity checks against real-world online health queries (e.g., ambiguous, personal, or symptom-based questions). Because the framework is positioned for high-stakes medical use, this selection effect is load-bearing for generalizing the reproducibility gap beyond the tested set.
[Evaluation Metrics] Evaluation Metrics section: The exact prompt template, scoring rubric, and decision rules for the LLM-as-judge are not fully specified in the text. This hinders verification of the quality metrics and is load-bearing for claims that combine lexical, semantic, and judge-based scores.

minor comments (2)

[Abstract] Abstract: The note that MedGemma underperforms while confounding fine-tuning with scale could be moved earlier or phrased more prominently to prevent readers from over-interpreting the domain-adaptation result.
[Introduction] Introduction: A brief comparison table or citation to prior LLM consistency studies in healthcare would better situate the novelty of treating reproducibility as a first-class metric.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to strengthen the statistical presentation of results, provide better justification for the question selection, and fully document the LLM-as-judge methodology. Our point-by-point responses follow.

read point-by-point responses

Referee: [Results] Results section: The central claim that self-agreement reaches 'at most 0.20' and 87-97% of outputs are unique is reported only in aggregate without per-question variance, bootstrap confidence intervals, or statistical tests on the agreement distributions. This directly affects the precision and robustness of the safety-gap argument given N=10 runs.

Authors: We agree that aggregate-only reporting limits the ability to assess robustness. In the revised manuscript we now include a supplementary table with per-question self-agreement and uniqueness values, bootstrap confidence intervals (1,000 resamples) around the reported means, and a short description of the agreement distribution across the 50 questions. These additions show that the upper bound of 0.20 is not driven by a few atypical questions and that the low-reproducibility finding holds consistently, thereby reinforcing rather than weakening the safety-gap argument. revision: yes
Referee: [Methodology] Methodology section: The selection of the 50 MedQuAD questions is not accompanied by justification, diversity analysis, or sensitivity checks against real-world online health queries (e.g., ambiguous, personal, or symptom-based questions). Because the framework is positioned for high-stakes medical use, this selection effect is load-bearing for generalizing the reproducibility gap beyond the tested set.

Authors: The 50 questions were drawn uniformly at random from MedQuAD to obtain a manageable yet diverse sample. We have added an explicit justification paragraph and a diversity table (topic distribution derived from MedQuAD category labels) to the Methodology section. We also performed a limited sensitivity check on a 10-question subset containing more open-ended phrasing; reproducibility metrics remained in the same low range. Full validation against live forum queries would require new data collection outside the current scope and is now listed as a limitation and future-work item. revision: partial
Referee: [Evaluation Metrics] Evaluation Metrics section: The exact prompt template, scoring rubric, and decision rules for the LLM-as-judge are not fully specified in the text. This hinders verification of the quality metrics and is load-bearing for claims that combine lexical, semantic, and judge-based scores.

Authors: The complete prompt templates, the five-point rubric used by the judge (covering factual accuracy, relevance, and internal consistency), and the exact decision rules for aggregating judge scores are already present in the public GitHub repository. To eliminate any ambiguity in the paper itself, we have inserted the full templates and rubric into a new Appendix A, together with an example of judge output and the aggregation procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical computation from model outputs

full rationale

This is a pure benchmarking study that runs three LLMs on 50 fixed MedQuAD questions for N=10 repetitions each, then computes eight standard metrics (BERTScore, ROUGE-L, LLM-as-judge, plus two reproducibility statistics) directly from the 1,500 generated strings. No equations, no fitted parameters renamed as predictions, no uniqueness theorems, and no self-citation chains appear in the derivation. All reported figures (self-agreement ≤0.20, 87-97% unique outputs) are literal aggregates of the observed outputs against external references; the methodology is fully specified for replication without reference to any prior result by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, postulates no invented entities, and relies only on standard domain assumptions about the importance of consistency in medical tools plus established NLP evaluation practices.

axioms (1)

domain assumption Medical question answering requires output consistency in addition to average accuracy for reliability as a tool
Stated explicitly in the opening sentence of the abstract as the core motivation.

pith-pipeline@v0.9.0 · 5633 in / 1279 out tokens · 50846 ms · 2026-05-10T16:04:11.892545+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Ben Abacha and D

A. Ben Abacha and D. Demner-Fushman. A ques- tion entailment approach to question answering. BMC Bioinformatics, 20(1):511, 2019

2019
[2]

The Llama 3 Herd of Models

A. Dubey et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Jin et al

D. Jin et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021
[4]

Jin et al

Q. Jin et al. PubMedQA: A biomedical research question answering dataset. InProc. EMNLP, 2019

2019
[5]

A. E. W. Johnson et al. MIMIC-IV , a freely accessi- ble electronic health record dataset.Scientific Data, 10(1):1, 2023

2023
[6]

C.-Y . Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out (ACL Workshop), 2004

2004
[7]

MedGemma: Clinically fine-tuned Gemma models

Google DeepMind / medaibase contributors. MedGemma: Clinically fine-tuned Gemma models. https://huggingface.co/medaibase, 2024

2024
[8]

Ollama: Run large language models locally.https://ollama.com, 2024

Ollama Contributors. Ollama: Run large language models locally.https://ollama.com, 2024

2024
[9]

Papineni et al

K. Papineni et al. BLEU: A method for automatic evaluation of machine translation. InProc. ACL, 2002

2002
[10]

Singhal et al

K. Singhal et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023
[11]

Gemma 3 Technical Report

Gemma Team, Google DeepMind. Gemma 3 tech- nical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Zhang et al

T. Zhang et al. BERTScore: Evaluating text gener- ation with BERT. InProc. ICLR, 2020

2020
[13]

Zheng et al

L. Zheng et al. Judging LLM-as-a-judge with MT- Bench and chatbot arena. InNeurIPS, 2023

2023
[14]

Wang et al

X. Wang et al. Self-consistency improves chain of thought reasoning in language models. InProc. ICLR, 2023

2023
[15]

Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023

X. Chen et al. Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023

work page arXiv 2023
[16]

Language Models (Mostly) Know What They Know

S. Kadavath et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

arXiv preprint arXiv:2311.16452 , year=

H. Nori et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine.arXiv preprint arXiv:2311.16452, 2023

work page arXiv 2023
[18]

Suarez-Lledo and J

V . Suarez-Lledo and J. Alvarez-Galvez. Prevalence of health misinformation on social media: System- atic review.Journal of Medical Internet Research, 23(1):e17187, 2021

2021
[19]

M. A. Sager et al. Identifying and responding to health misinformation on Reddit dermatology fo- rums with artificially intelligent bots using natural language processing: Design and evaluation study. JMIR Dermatology, 4(2):e20975, 2021. 6

2021