pith. machine review for the scientific record. sign in

arxiv: 2604.10535 · v1 · submitted 2026-04-12 · 💻 cs.IR · cs.CL

Recognition: unknown

Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:04 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords medical question answeringLLM reproducibilityoutput consistencysmall language modelsevaluation frameworkMedQuADBERTScoreROUGE-L
0
0 comments X

The pith

Small open LLMs generate unique answers to the same medical question in 87-97 percent of cases even at low temperature.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that medical question answering with LLMs requires treating output consistency as a core requirement alongside accuracy, because inconsistent responses can spread conflicting information in online health communities. It supplies a complete open-source pipeline that runs each question ten times, then scores the outputs with eight quality measures such as BERTScore and ROUGE-L plus two reproducibility statistics derived from the repeated runs. When the pipeline is applied to three small models on fifty MedQuAD questions, self-agreement never exceeds 0.20 and the great majority of answers differ from one another. The clinically fine-tuned smallest model performs worse than the larger general-purpose models on both quality and consistency, although size and fine-tuning remain entangled. These results indicate that single-run benchmarks overlook a practical safety limitation for any deployment that aims to serve users seeking reliable medical information.

Core claim

The central claim is that low-temperature generation (T=0.2) still produces at most 0.20 self-agreement across ten runs per question, with 87-97 percent of all outputs per model being unique. The evaluation framework computes eight quality metrics (BERTScore, ROUGE-L, an LLM-as-judge rubric, and others) together with two within-model reproducibility metrics from the repeated inferences. Applied to Llama 3.1 8B, Gemma 3 12B, and MedGemma 1.5 4B on the fifty MedQuAD questions, the clinically fine-tuned MedGemma underperforms the larger general models on both quality and reproducibility, though the comparison mixes domain adaptation with model scale. The full methodology is described so that it

What carries the argument

The open-source evaluation pipeline that generates N=10 low-temperature responses per question and extracts two within-model reproducibility metrics from the set of outputs, used in parallel with lexical, semantic, and LLM-judge quality metrics.

If this is right

  • Model selection for medical use must include repeated-run reproducibility checks rather than single-pass accuracy alone.
  • The open pipeline lets practitioners compare any small open LLM on both correctness and stability for their own deployment needs.
  • Clinically fine-tuned models cannot be assumed to improve consistency without isolating scale from domain adaptation.
  • Low-temperature sampling by itself does not deliver the output stability required for reliable medical assistants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the low agreement persists across other question sets, then additional techniques such as prompt engineering or output ensembling would be needed before safe deployment in health forums.
  • Future tests that match model sizes while varying only the fine-tuning data could separate the effects of clinical adaptation on reproducibility.
  • Applying the same repeated-run protocol to actual user questions from online communities rather than curated datasets could expose different consistency behavior.

Load-bearing premise

The fifty MedQuAD questions together with ten runs each and the chosen lexical semantic and judge metrics are sufficient to represent the consistency needs of medical question answering in real online health communities.

What would settle it

Repeating the evaluation on a substantially larger or more diverse collection of medical questions and obtaining self-agreement above 0.5 or uniqueness below 70 percent would indicate that the reported gap does not hold under broader conditions.

read the original abstract

Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open-source evaluation framework for assessing small, locally-deployable open-weight LLMs on medical question answering, treating reproducibility as a first-class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE-L, and an LLM-as-judge rubric, together with two within-model reproducibility metrics derived from repeated inference (N=10 runs per question). Evaluating three models (Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B) on 50 MedQuAD questions (N=1,500 total responses) reveals that despite low-temperature generation (T=0.2), self-agreement across runs reaches at most 0.20, while 87-97% of all outputs per model are unique -- a safety gap that single-pass benchmarks entirely miss. The clinically fine-tuned MedGemma 1.5 4B underperforms the larger general-purpose models on both quality and reproducibility; however, because MedGemma is also the smallest model, this comparison confounds domain fine-tuning with model scale. We describe the methodology in sufficient detail for practitioners to replicate or extend the evaluation for their own model-selection workflows. All code and data pipelines are available at https://github.com/aviad-buskila/llm_medical_reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a practical open-source evaluation framework for small open-weight LLMs on medical question answering, treating reproducibility as a first-class metric. It evaluates Llama 3.1 8B, Gemma 3 12B, and MedGemma 1.5 4B on 50 MedQuAD questions (N=10 repeated inferences each at T=0.2, totaling 1,500 responses) using eight metrics including BERTScore, ROUGE-L, and an LLM-as-judge rubric, plus two within-model reproducibility metrics. Key findings are that self-agreement reaches at most 0.20 and 87-97% of outputs per model are unique, exposing a safety gap missed by single-pass benchmarks; MedGemma underperforms but the comparison confounds fine-tuning with scale. All code and pipelines are publicly released.

Significance. If the low-reproducibility findings hold under the reported conditions, the work is significant for identifying consistency failures in LLM medical QA that standard accuracy benchmarks overlook, with direct relevance to misinformation risks in online health communities. The explicit provision of reproducible code, detailed methodology, and parameter-free empirical metrics is a clear strength that supports practitioner adoption and extension. Significance is reduced by the narrow empirical base, but the framework itself offers a useful template for future model-selection workflows.

major comments (3)
  1. [Results] Results section: The central claim that self-agreement reaches 'at most 0.20' and 87-97% of outputs are unique is reported only in aggregate without per-question variance, bootstrap confidence intervals, or statistical tests on the agreement distributions. This directly affects the precision and robustness of the safety-gap argument given N=10 runs.
  2. [Methodology] Methodology section: The selection of the 50 MedQuAD questions is not accompanied by justification, diversity analysis, or sensitivity checks against real-world online health queries (e.g., ambiguous, personal, or symptom-based questions). Because the framework is positioned for high-stakes medical use, this selection effect is load-bearing for generalizing the reproducibility gap beyond the tested set.
  3. [Evaluation Metrics] Evaluation Metrics section: The exact prompt template, scoring rubric, and decision rules for the LLM-as-judge are not fully specified in the text. This hinders verification of the quality metrics and is load-bearing for claims that combine lexical, semantic, and judge-based scores.
minor comments (2)
  1. [Abstract] Abstract: The note that MedGemma underperforms while confounding fine-tuning with scale could be moved earlier or phrased more prominently to prevent readers from over-interpreting the domain-adaptation result.
  2. [Introduction] Introduction: A brief comparison table or citation to prior LLM consistency studies in healthcare would better situate the novelty of treating reproducibility as a first-class metric.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to strengthen the statistical presentation of results, provide better justification for the question selection, and fully document the LLM-as-judge methodology. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Results] Results section: The central claim that self-agreement reaches 'at most 0.20' and 87-97% of outputs are unique is reported only in aggregate without per-question variance, bootstrap confidence intervals, or statistical tests on the agreement distributions. This directly affects the precision and robustness of the safety-gap argument given N=10 runs.

    Authors: We agree that aggregate-only reporting limits the ability to assess robustness. In the revised manuscript we now include a supplementary table with per-question self-agreement and uniqueness values, bootstrap confidence intervals (1,000 resamples) around the reported means, and a short description of the agreement distribution across the 50 questions. These additions show that the upper bound of 0.20 is not driven by a few atypical questions and that the low-reproducibility finding holds consistently, thereby reinforcing rather than weakening the safety-gap argument. revision: yes

  2. Referee: [Methodology] Methodology section: The selection of the 50 MedQuAD questions is not accompanied by justification, diversity analysis, or sensitivity checks against real-world online health queries (e.g., ambiguous, personal, or symptom-based questions). Because the framework is positioned for high-stakes medical use, this selection effect is load-bearing for generalizing the reproducibility gap beyond the tested set.

    Authors: The 50 questions were drawn uniformly at random from MedQuAD to obtain a manageable yet diverse sample. We have added an explicit justification paragraph and a diversity table (topic distribution derived from MedQuAD category labels) to the Methodology section. We also performed a limited sensitivity check on a 10-question subset containing more open-ended phrasing; reproducibility metrics remained in the same low range. Full validation against live forum queries would require new data collection outside the current scope and is now listed as a limitation and future-work item. revision: partial

  3. Referee: [Evaluation Metrics] Evaluation Metrics section: The exact prompt template, scoring rubric, and decision rules for the LLM-as-judge are not fully specified in the text. This hinders verification of the quality metrics and is load-bearing for claims that combine lexical, semantic, and judge-based scores.

    Authors: The complete prompt templates, the five-point rubric used by the judge (covering factual accuracy, relevance, and internal consistency), and the exact decision rules for aggregating judge scores are already present in the public GitHub repository. To eliminate any ambiguity in the paper itself, we have inserted the full templates and rubric into a new Appendix A, together with an example of judge output and the aggregation procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical computation from model outputs

full rationale

This is a pure benchmarking study that runs three LLMs on 50 fixed MedQuAD questions for N=10 repetitions each, then computes eight standard metrics (BERTScore, ROUGE-L, LLM-as-judge, plus two reproducibility statistics) directly from the 1,500 generated strings. No equations, no fitted parameters renamed as predictions, no uniqueness theorems, and no self-citation chains appear in the derivation. All reported figures (self-agreement ≤0.20, 87-97% unique outputs) are literal aggregates of the observed outputs against external references; the methodology is fully specified for replication without reference to any prior result by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, postulates no invented entities, and relies only on standard domain assumptions about the importance of consistency in medical tools plus established NLP evaluation practices.

axioms (1)
  • domain assumption Medical question answering requires output consistency in addition to average accuracy for reliability as a tool
    Stated explicitly in the opening sentence of the abstract as the core motivation.

pith-pipeline@v0.9.0 · 5633 in / 1279 out tokens · 50846 ms · 2026-05-10T16:04:11.892545+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    Ben Abacha and D

    A. Ben Abacha and D. Demner-Fushman. A ques- tion entailment approach to question answering. BMC Bioinformatics, 20(1):511, 2019

  2. [2]

    The Llama 3 Herd of Models

    A. Dubey et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  3. [3]

    Jin et al

    D. Jin et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  4. [4]

    Jin et al

    Q. Jin et al. PubMedQA: A biomedical research question answering dataset. InProc. EMNLP, 2019

  5. [5]

    A. E. W. Johnson et al. MIMIC-IV , a freely accessi- ble electronic health record dataset.Scientific Data, 10(1):1, 2023

  6. [6]

    C.-Y . Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out (ACL Workshop), 2004

  7. [7]

    MedGemma: Clinically fine-tuned Gemma models

    Google DeepMind / medaibase contributors. MedGemma: Clinically fine-tuned Gemma models. https://huggingface.co/medaibase, 2024

  8. [8]

    Ollama: Run large language models locally.https://ollama.com, 2024

    Ollama Contributors. Ollama: Run large language models locally.https://ollama.com, 2024

  9. [9]

    Papineni et al

    K. Papineni et al. BLEU: A method for automatic evaluation of machine translation. InProc. ACL, 2002

  10. [10]

    Singhal et al

    K. Singhal et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

  11. [11]

    Gemma 3 Technical Report

    Gemma Team, Google DeepMind. Gemma 3 tech- nical report.arXiv preprint arXiv:2503.19786, 2025

  12. [12]

    Zhang et al

    T. Zhang et al. BERTScore: Evaluating text gener- ation with BERT. InProc. ICLR, 2020

  13. [13]

    Zheng et al

    L. Zheng et al. Judging LLM-as-a-judge with MT- Bench and chatbot arena. InNeurIPS, 2023

  14. [14]

    Wang et al

    X. Wang et al. Self-consistency improves chain of thought reasoning in language models. InProc. ICLR, 2023

  15. [15]

    Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023

    X. Chen et al. Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023

  16. [16]

    Language Models (Mostly) Know What They Know

    S. Kadavath et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  17. [17]

    arXiv preprint arXiv:2311.16452 , year=

    H. Nori et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine.arXiv preprint arXiv:2311.16452, 2023

  18. [18]

    Suarez-Lledo and J

    V . Suarez-Lledo and J. Alvarez-Galvez. Prevalence of health misinformation on social media: System- atic review.Journal of Medical Internet Research, 23(1):e17187, 2021

  19. [19]

    M. A. Sager et al. Identifying and responding to health misinformation on Reddit dermatology fo- rums with artificially intelligent bots using natural language processing: Design and evaluation study. JMIR Dermatology, 4(2):e20975, 2021. 6