arxiv: 2604.23801 · v1 · submitted 2026-04-26 · 💻 cs.CL · cs.IR

Recognition: unknown

Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

Avi-ad Avraam Buskila

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:08 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords domain fine-tuningretrieval-augmented generationmedical question answeringMedQA-USMLElarge language models4B parameter scale

0 comments

The pith

Domain fine-tuning outperforms retrieval-augmented generation for medical multiple-choice questions at the 4B-parameter scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ways to add medical knowledge to a small 4B-parameter language model: fine-tuning it on domain data or adding retrieved medical passages to its input at inference time. It tests all combinations on the MedQA-USMLE benchmark using the same model size, prompt, and evaluation method. Fine-tuning improves majority-vote accuracy by 6.8 points over the base model, while retrieval adds no significant benefit and may even hurt the fine-tuned version slightly. This suggests that for this scale and task, knowledge baked into the model weights works better than knowledge supplied in the prompt context. The result helps decide resource allocation when deploying open small models in medicine.

Core claim

By holding model size, prompt, decoding, retrieval, and evaluation fixed and varying only domain adaptation and RAG presence, the experiment shows domain fine-tuning raises majority-vote accuracy from 46.4% to 53.3% on the 1,273-question MedQA-USMLE test set, a gain significant at p < 10^-4 by McNemar test, whereas RAG yields no significant improvement.

What carries the argument

The 2x2 controlled comparison of a general 4B model versus its domain-fine-tuned counterpart, each run with and without retrieved medical explanations from MedMCQA.

Load-bearing premise

That the chosen retrieval corpus and pipeline represent a fair implementation of RAG for this task.

What would settle it

Repeating the comparison with a stronger retrieval system such as dense vector search with reranking over a larger medical corpus and checking whether RAG then produces a statistically significant accuracy gain.

Figures

Figures reproduced from arXiv: 2604.23801 by Avi-ad Avraam Buskila.

**Figure 1.** Figure 1: Majority-vote accuracy with 95% confidence intervals. The +6 view at source ↗

**Figure 2.** Figure 2: Pairwise McNemar p-values across the four setups. The two non-significant cells are precisely the two RAG-toggle comparisons. Single corpus and embedding. We use one retrieval corpus (MedMCQA explanations) and one embedding model (nomic-embed-text). A different corpus (e.g., UpToDate, clinical guidelines, or MedQA-aligned textbooks) or a stronger embedder could change the RAG result. We mitigated obvious c… view at source ↗

**Figure 3.** Figure 3: Within-setup consistency across repetitions. view at source ↗

read the original abstract

Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p < 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning beats this RAG setup by 7 points at 4B scale on MedQA-USMLE, but the retrieval corpus is narrow enough that the weights-vs-context claim stays tied to that choice.

read the letter

This paper runs a clean head-to-head at exactly 4B parameters. Domain fine-tuning lifts majority-vote accuracy from 46.4% to 53.3% on the full MedQA-USMLE test set, while RAG over MedMCQA explanations adds nothing and trends slightly negative in the tuned model. The 2x2 design keeps model size, prompt, temperature, and evaluation fixed, uses three repetitions per question, applies McNemar tests, and ships the code plus traces. That level of control and transparency is the real strength here. It gives practitioners a direct, replicable data point on whether to spend compute on fine-tuning or on retrieval infrastructure for small open models in medicine. The soft spot is the RAG implementation itself. The corpus is limited to explanations from MedMCQA, another exam-style collection. That source may simply not overlap enough in depth or coverage with MedQA-USMLE items to test what a stronger retrieval system could deliver. Without an ablation on a broader corpus like PubMed or a different retriever, the finding that weights dominate context holds only for this specific pipeline. The work is incremental rather than foundational, but the controls and release make it useful for anyone choosing deployment options at this scale. It is the sort of careful empirical comparison that belongs in peer review so others can verify the traces and extend the retrieval side.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a controlled 2×2 comparison of domain fine-tuning versus retrieval-augmented generation (RAG) for medical multiple-choice question answering using 4B-parameter models. Holding model size, prompt, temperature, and evaluation fixed, it evaluates Gemma-3-4B and MedGemma-4B with and without RAG over MedMCQA explanations on the full MedQA-USMLE test set (1,273 questions, 3 repetitions each). The key finding is that domain fine-tuning improves majority-vote accuracy by 6.8 percentage points (53.3% vs. 46.4%, McNemar p < 10^{-4}), while RAG yields no significant gain and a slight negative point estimate in the fine-tuned model.

Significance. If the results hold, the work supplies a clean empirical comparison showing that, at the 4B scale on MedQA-USMLE, domain adaptation via fine-tuning outperforms the tested form of in-context knowledge injection. Credit is due for the fully crossed design, three repetitions per item, McNemar testing, and public release of code plus JSONL traces. These elements make the measurements directly replicable and strengthen the practical takeaway for small-model deployment in medicine.

major comments (2)

[Methods (RAG corpus and pipeline)] Methods (RAG corpus and pipeline): The central claim that 'domain knowledge encoded in weights dominates domain knowledge supplied in context' depends on the RAG condition producing no benefit. The retrieval corpus is restricted to explanations from the MedMCQA dataset. Because MedMCQA is a separate exam-style collection, its explanations may have limited topical overlap, depth, or lexical match with MedQA-USMLE items. Without an ablation using a broader corpus (e.g., PubMed or textbooks) or a stronger retriever, the null/negative RAG result may reflect a weak context-injection baseline rather than a general property of weights versus context. This is load-bearing for the dominance interpretation and the practitioner recommendation.
[Results (§4) and Discussion] Results (§4) and Discussion: The paper reports point estimates and McNemar p-values for the four conditions but does not provide per-question error analysis or breakdown by medical topic. Such an analysis would clarify whether the fine-tuning advantage is concentrated in areas where MedMCQA explanations are least relevant, directly testing the corpus-overlap concern raised above.

minor comments (2)

[Methods] The abstract states 'three repetitions per question (15,276 LLM calls)' but the methods should explicitly state the aggregation rule for majority vote (e.g., tie-breaking procedure) and confirm that the same seed or temperature settings were used across all cells.
[Results] Table or figure presenting the four accuracy numbers should include both per-run accuracies and the majority-vote accuracies to allow readers to assess variance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the strengths of our fully crossed design, statistical testing, and reproducibility measures. We address each major comment below, indicating where revisions will be made.

read point-by-point responses

Referee: Methods (RAG corpus and pipeline): The central claim that 'domain knowledge encoded in weights dominates domain knowledge supplied in context' depends on the RAG condition producing no benefit. The retrieval corpus is restricted to explanations from the MedMCQA dataset. Because MedMCQA is a separate exam-style collection, its explanations may have limited topical overlap, depth, or lexical match with MedQA-USMLE items. Without an ablation using a broader corpus (e.g., PubMed or textbooks) or a stronger retriever, the null/negative RAG result may reflect a weak context-injection baseline rather than a general property of weights versus context. This is load-bearing for the dominance interpretation and the practitioner recommendation.

Authors: We agree that the RAG corpus (MedMCQA explanations) is a specific choice and that a broader corpus such as PubMed abstracts or medical textbooks might produce stronger context injection. Our experiment deliberately holds the retrieval pipeline, corpus, and prompt fixed to isolate the effect of domain fine-tuning versus this form of in-context augmentation. The reported result is therefore conditional on the tested RAG configuration: at the 4B scale, fine-tuning yields a statistically significant gain while the chosen RAG does not. We do not claim that no possible RAG setup could ever close the gap. In the revised manuscript we will add an explicit limitations paragraph in the Discussion clarifying the scope of the claim and recommending that future comparisons test stronger retrievers and corpora. This preserves the value of the controlled comparison while acknowledging the referee's valid point about generalizability. revision: partial
Referee: Results (§4) and Discussion: The paper reports point estimates and McNemar p-values for the four conditions but does not provide per-question error analysis or breakdown by medical topic. Such an analysis would clarify whether the fine-tuning advantage is concentrated in areas where MedMCQA explanations are least relevant, directly testing the corpus-overlap concern raised above.

Authors: We accept this recommendation. The released JSONL traces contain per-question predictions across all four conditions, making such an analysis feasible. In the revision we will add a new subsection in Results that (i) reports accuracy stratified by available MedQA-USMLE subject categories and (ii) examines whether the fine-tuning advantage is larger on questions whose retrieved MedMCQA explanations exhibit low lexical or embedding overlap with the question stem. This directly tests the corpus-overlap hypothesis and will be accompanied by a brief discussion of any patterns observed. Because the traces are already public, the additional analysis can be performed without new model calls. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with direct measurements

full rationale

The paper conducts a controlled 2x2 experiment measuring accuracy on MedQA-USMLE under fixed conditions, varying only domain adaptation and RAG presence. All reported gains, p-values, and conclusions are computed directly from the 15,276 LLM calls and majority votes; no equations, derivations, parameter fits, or predictions are defined in terms of the outputs. No self-citations are load-bearing, and the design contains no ansatz, uniqueness theorem, or renaming of prior results. The skeptic concern addresses experimental coverage rather than logical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that MedQA-USMLE is a valid proxy for medical QA capability and that the retrieval pipeline is a reasonable test of RAG.

axioms (1)

domain assumption MedQA-USMLE is a valid and representative benchmark for medical multiple-choice question answering.
The paper adopts the benchmark without additional validation or discussion of its limitations.

pith-pipeline@v0.9.0 · 5590 in / 1229 out tokens · 70159 ms · 2026-05-08T06:08:06.318113+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Chroma: the ai-native open-source embedding database

Chroma. Chroma: the ai-native open-source embedding database. https://www.trychroma. com/, 2024

2024
[2]

Dietterich

Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms, 1998

1998
[3]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team and Google DeepMind. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review arXiv 2024
[4]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021
[5]

Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨ aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020. 7

2020
[6]

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023

work page internal anchor Pith review arXiv 2023
[7]

Morris, Brandon Duderstadt, and Andriy Mulyar

Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder, 2024

2024
[8]

Ollama: Get up and running with large language models locally

Ollama Contributors. Ollama: Get up and running with large language models locally. https://ollama.com, 2024

2024
[9]

MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, pages 248–260, 2022

2022
[10]

MedGemma Technical Report

Andrew Sellergren et al. MedGemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023
[12]

Augmenting black-box LLMs with medical textbooks for biomedical question answering.arXiv preprint arXiv:2309.02233, 2024

Yubo Wang, Xueguang Ma, and Wenhu Chen. Augmenting black-box LLMs with medical textbooks for biomedical question answering.arXiv preprint arXiv:2309.02233, 2024

work page arXiv 2024
[13]

Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval- augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024. 8

work page arXiv 2024