Recognition: unknown
Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework
Pith reviewed 2026-05-10 16:04 UTC · model grok-4.3
The pith
Small open LLMs generate unique answers to the same medical question in 87-97 percent of cases even at low temperature.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that low-temperature generation (T=0.2) still produces at most 0.20 self-agreement across ten runs per question, with 87-97 percent of all outputs per model being unique. The evaluation framework computes eight quality metrics (BERTScore, ROUGE-L, an LLM-as-judge rubric, and others) together with two within-model reproducibility metrics from the repeated inferences. Applied to Llama 3.1 8B, Gemma 3 12B, and MedGemma 1.5 4B on the fifty MedQuAD questions, the clinically fine-tuned MedGemma underperforms the larger general models on both quality and reproducibility, though the comparison mixes domain adaptation with model scale. The full methodology is described so that it
What carries the argument
The open-source evaluation pipeline that generates N=10 low-temperature responses per question and extracts two within-model reproducibility metrics from the set of outputs, used in parallel with lexical, semantic, and LLM-judge quality metrics.
If this is right
- Model selection for medical use must include repeated-run reproducibility checks rather than single-pass accuracy alone.
- The open pipeline lets practitioners compare any small open LLM on both correctness and stability for their own deployment needs.
- Clinically fine-tuned models cannot be assumed to improve consistency without isolating scale from domain adaptation.
- Low-temperature sampling by itself does not deliver the output stability required for reliable medical assistants.
Where Pith is reading between the lines
- If the low agreement persists across other question sets, then additional techniques such as prompt engineering or output ensembling would be needed before safe deployment in health forums.
- Future tests that match model sizes while varying only the fine-tuning data could separate the effects of clinical adaptation on reproducibility.
- Applying the same repeated-run protocol to actual user questions from online communities rather than curated datasets could expose different consistency behavior.
Load-bearing premise
The fifty MedQuAD questions together with ten runs each and the chosen lexical semantic and judge metrics are sufficient to represent the consistency needs of medical question answering in real online health communities.
What would settle it
Repeating the evaluation on a substantially larger or more diverse collection of medical questions and obtaining self-agreement above 0.5 or uniqueness below 70 percent would indicate that the reported gap does not hold under broader conditions.
read the original abstract
Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open-source evaluation framework for assessing small, locally-deployable open-weight LLMs on medical question answering, treating reproducibility as a first-class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE-L, and an LLM-as-judge rubric, together with two within-model reproducibility metrics derived from repeated inference (N=10 runs per question). Evaluating three models (Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B) on 50 MedQuAD questions (N=1,500 total responses) reveals that despite low-temperature generation (T=0.2), self-agreement across runs reaches at most 0.20, while 87-97% of all outputs per model are unique -- a safety gap that single-pass benchmarks entirely miss. The clinically fine-tuned MedGemma 1.5 4B underperforms the larger general-purpose models on both quality and reproducibility; however, because MedGemma is also the smallest model, this comparison confounds domain fine-tuning with model scale. We describe the methodology in sufficient detail for practitioners to replicate or extend the evaluation for their own model-selection workflows. All code and data pipelines are available at https://github.com/aviad-buskila/llm_medical_reproducibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a practical open-source evaluation framework for small open-weight LLMs on medical question answering, treating reproducibility as a first-class metric. It evaluates Llama 3.1 8B, Gemma 3 12B, and MedGemma 1.5 4B on 50 MedQuAD questions (N=10 repeated inferences each at T=0.2, totaling 1,500 responses) using eight metrics including BERTScore, ROUGE-L, and an LLM-as-judge rubric, plus two within-model reproducibility metrics. Key findings are that self-agreement reaches at most 0.20 and 87-97% of outputs per model are unique, exposing a safety gap missed by single-pass benchmarks; MedGemma underperforms but the comparison confounds fine-tuning with scale. All code and pipelines are publicly released.
Significance. If the low-reproducibility findings hold under the reported conditions, the work is significant for identifying consistency failures in LLM medical QA that standard accuracy benchmarks overlook, with direct relevance to misinformation risks in online health communities. The explicit provision of reproducible code, detailed methodology, and parameter-free empirical metrics is a clear strength that supports practitioner adoption and extension. Significance is reduced by the narrow empirical base, but the framework itself offers a useful template for future model-selection workflows.
major comments (3)
- [Results] Results section: The central claim that self-agreement reaches 'at most 0.20' and 87-97% of outputs are unique is reported only in aggregate without per-question variance, bootstrap confidence intervals, or statistical tests on the agreement distributions. This directly affects the precision and robustness of the safety-gap argument given N=10 runs.
- [Methodology] Methodology section: The selection of the 50 MedQuAD questions is not accompanied by justification, diversity analysis, or sensitivity checks against real-world online health queries (e.g., ambiguous, personal, or symptom-based questions). Because the framework is positioned for high-stakes medical use, this selection effect is load-bearing for generalizing the reproducibility gap beyond the tested set.
- [Evaluation Metrics] Evaluation Metrics section: The exact prompt template, scoring rubric, and decision rules for the LLM-as-judge are not fully specified in the text. This hinders verification of the quality metrics and is load-bearing for claims that combine lexical, semantic, and judge-based scores.
minor comments (2)
- [Abstract] Abstract: The note that MedGemma underperforms while confounding fine-tuning with scale could be moved earlier or phrased more prominently to prevent readers from over-interpreting the domain-adaptation result.
- [Introduction] Introduction: A brief comparison table or citation to prior LLM consistency studies in healthcare would better situate the novelty of treating reproducibility as a first-class metric.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to strengthen the statistical presentation of results, provide better justification for the question selection, and fully document the LLM-as-judge methodology. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Results] Results section: The central claim that self-agreement reaches 'at most 0.20' and 87-97% of outputs are unique is reported only in aggregate without per-question variance, bootstrap confidence intervals, or statistical tests on the agreement distributions. This directly affects the precision and robustness of the safety-gap argument given N=10 runs.
Authors: We agree that aggregate-only reporting limits the ability to assess robustness. In the revised manuscript we now include a supplementary table with per-question self-agreement and uniqueness values, bootstrap confidence intervals (1,000 resamples) around the reported means, and a short description of the agreement distribution across the 50 questions. These additions show that the upper bound of 0.20 is not driven by a few atypical questions and that the low-reproducibility finding holds consistently, thereby reinforcing rather than weakening the safety-gap argument. revision: yes
-
Referee: [Methodology] Methodology section: The selection of the 50 MedQuAD questions is not accompanied by justification, diversity analysis, or sensitivity checks against real-world online health queries (e.g., ambiguous, personal, or symptom-based questions). Because the framework is positioned for high-stakes medical use, this selection effect is load-bearing for generalizing the reproducibility gap beyond the tested set.
Authors: The 50 questions were drawn uniformly at random from MedQuAD to obtain a manageable yet diverse sample. We have added an explicit justification paragraph and a diversity table (topic distribution derived from MedQuAD category labels) to the Methodology section. We also performed a limited sensitivity check on a 10-question subset containing more open-ended phrasing; reproducibility metrics remained in the same low range. Full validation against live forum queries would require new data collection outside the current scope and is now listed as a limitation and future-work item. revision: partial
-
Referee: [Evaluation Metrics] Evaluation Metrics section: The exact prompt template, scoring rubric, and decision rules for the LLM-as-judge are not fully specified in the text. This hinders verification of the quality metrics and is load-bearing for claims that combine lexical, semantic, and judge-based scores.
Authors: The complete prompt templates, the five-point rubric used by the judge (covering factual accuracy, relevance, and internal consistency), and the exact decision rules for aggregating judge scores are already present in the public GitHub repository. To eliminate any ambiguity in the paper itself, we have inserted the full templates and rubric into a new Appendix A, together with an example of judge output and the aggregation procedure. revision: yes
Circularity Check
No circularity: direct empirical computation from model outputs
full rationale
This is a pure benchmarking study that runs three LLMs on 50 fixed MedQuAD questions for N=10 repetitions each, then computes eight standard metrics (BERTScore, ROUGE-L, LLM-as-judge, plus two reproducibility statistics) directly from the 1,500 generated strings. No equations, no fitted parameters renamed as predictions, no uniqueness theorems, and no self-citation chains appear in the derivation. All reported figures (self-agreement ≤0.20, 87-97% unique outputs) are literal aggregates of the observed outputs against external references; the methodology is fully specified for replication without reference to any prior result by the same authors.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Medical question answering requires output consistency in addition to average accuracy for reliability as a tool
Reference graph
Works this paper leans on
-
[1]
Ben Abacha and D
A. Ben Abacha and D. Demner-Fushman. A ques- tion entailment approach to question answering. BMC Bioinformatics, 20(1):511, 2019
2019
-
[2]
A. Dubey et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Jin et al
D. Jin et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
2021
-
[4]
Jin et al
Q. Jin et al. PubMedQA: A biomedical research question answering dataset. InProc. EMNLP, 2019
2019
-
[5]
A. E. W. Johnson et al. MIMIC-IV , a freely accessi- ble electronic health record dataset.Scientific Data, 10(1):1, 2023
2023
-
[6]
C.-Y . Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out (ACL Workshop), 2004
2004
-
[7]
MedGemma: Clinically fine-tuned Gemma models
Google DeepMind / medaibase contributors. MedGemma: Clinically fine-tuned Gemma models. https://huggingface.co/medaibase, 2024
2024
-
[8]
Ollama: Run large language models locally.https://ollama.com, 2024
Ollama Contributors. Ollama: Run large language models locally.https://ollama.com, 2024
2024
-
[9]
Papineni et al
K. Papineni et al. BLEU: A method for automatic evaluation of machine translation. InProc. ACL, 2002
2002
-
[10]
Singhal et al
K. Singhal et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
2023
-
[11]
Gemma Team, Google DeepMind. Gemma 3 tech- nical report.arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Zhang et al
T. Zhang et al. BERTScore: Evaluating text gener- ation with BERT. InProc. ICLR, 2020
2020
-
[13]
Zheng et al
L. Zheng et al. Judging LLM-as-a-judge with MT- Bench and chatbot arena. InNeurIPS, 2023
2023
-
[14]
Wang et al
X. Wang et al. Self-consistency improves chain of thought reasoning in language models. InProc. ICLR, 2023
2023
-
[15]
Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023
X. Chen et al. Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023
-
[16]
Language Models (Mostly) Know What They Know
S. Kadavath et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
arXiv preprint arXiv:2311.16452 , year=
H. Nori et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine.arXiv preprint arXiv:2311.16452, 2023
-
[18]
Suarez-Lledo and J
V . Suarez-Lledo and J. Alvarez-Galvez. Prevalence of health misinformation on social media: System- atic review.Journal of Medical Internet Research, 23(1):e17187, 2021
2021
-
[19]
M. A. Sager et al. Identifying and responding to health misinformation on Reddit dermatology fo- rums with artificially intelligent bots using natural language processing: Design and evaluation study. JMIR Dermatology, 4(2):e20975, 2021. 6
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.