Recognition: unknown
Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
Pith reviewed 2026-05-08 06:08 UTC · model grok-4.3
The pith
Domain fine-tuning outperforms retrieval-augmented generation for medical multiple-choice questions at the 4B-parameter scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By holding model size, prompt, decoding, retrieval, and evaluation fixed and varying only domain adaptation and RAG presence, the experiment shows domain fine-tuning raises majority-vote accuracy from 46.4% to 53.3% on the 1,273-question MedQA-USMLE test set, a gain significant at p < 10^-4 by McNemar test, whereas RAG yields no significant improvement.
What carries the argument
The 2x2 controlled comparison of a general 4B model versus its domain-fine-tuned counterpart, each run with and without retrieved medical explanations from MedMCQA.
Load-bearing premise
That the chosen retrieval corpus and pipeline represent a fair implementation of RAG for this task.
What would settle it
Repeating the comparison with a stronger retrieval system such as dense vector search with reranking over a larger medical corpus and checking whether RAG then produces a statistically significant accuracy gain.
Figures
read the original abstract
Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p < 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a controlled 2×2 comparison of domain fine-tuning versus retrieval-augmented generation (RAG) for medical multiple-choice question answering using 4B-parameter models. Holding model size, prompt, temperature, and evaluation fixed, it evaluates Gemma-3-4B and MedGemma-4B with and without RAG over MedMCQA explanations on the full MedQA-USMLE test set (1,273 questions, 3 repetitions each). The key finding is that domain fine-tuning improves majority-vote accuracy by 6.8 percentage points (53.3% vs. 46.4%, McNemar p < 10^{-4}), while RAG yields no significant gain and a slight negative point estimate in the fine-tuned model.
Significance. If the results hold, the work supplies a clean empirical comparison showing that, at the 4B scale on MedQA-USMLE, domain adaptation via fine-tuning outperforms the tested form of in-context knowledge injection. Credit is due for the fully crossed design, three repetitions per item, McNemar testing, and public release of code plus JSONL traces. These elements make the measurements directly replicable and strengthen the practical takeaway for small-model deployment in medicine.
major comments (2)
- [Methods (RAG corpus and pipeline)] Methods (RAG corpus and pipeline): The central claim that 'domain knowledge encoded in weights dominates domain knowledge supplied in context' depends on the RAG condition producing no benefit. The retrieval corpus is restricted to explanations from the MedMCQA dataset. Because MedMCQA is a separate exam-style collection, its explanations may have limited topical overlap, depth, or lexical match with MedQA-USMLE items. Without an ablation using a broader corpus (e.g., PubMed or textbooks) or a stronger retriever, the null/negative RAG result may reflect a weak context-injection baseline rather than a general property of weights versus context. This is load-bearing for the dominance interpretation and the practitioner recommendation.
- [Results (§4) and Discussion] Results (§4) and Discussion: The paper reports point estimates and McNemar p-values for the four conditions but does not provide per-question error analysis or breakdown by medical topic. Such an analysis would clarify whether the fine-tuning advantage is concentrated in areas where MedMCQA explanations are least relevant, directly testing the corpus-overlap concern raised above.
minor comments (2)
- [Methods] The abstract states 'three repetitions per question (15,276 LLM calls)' but the methods should explicitly state the aggregation rule for majority vote (e.g., tie-breaking procedure) and confirm that the same seed or temperature settings were used across all cells.
- [Results] Table or figure presenting the four accuracy numbers should include both per-run accuracies and the majority-vote accuracies to allow readers to assess variance.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting the strengths of our fully crossed design, statistical testing, and reproducibility measures. We address each major comment below, indicating where revisions will be made.
read point-by-point responses
-
Referee: Methods (RAG corpus and pipeline): The central claim that 'domain knowledge encoded in weights dominates domain knowledge supplied in context' depends on the RAG condition producing no benefit. The retrieval corpus is restricted to explanations from the MedMCQA dataset. Because MedMCQA is a separate exam-style collection, its explanations may have limited topical overlap, depth, or lexical match with MedQA-USMLE items. Without an ablation using a broader corpus (e.g., PubMed or textbooks) or a stronger retriever, the null/negative RAG result may reflect a weak context-injection baseline rather than a general property of weights versus context. This is load-bearing for the dominance interpretation and the practitioner recommendation.
Authors: We agree that the RAG corpus (MedMCQA explanations) is a specific choice and that a broader corpus such as PubMed abstracts or medical textbooks might produce stronger context injection. Our experiment deliberately holds the retrieval pipeline, corpus, and prompt fixed to isolate the effect of domain fine-tuning versus this form of in-context augmentation. The reported result is therefore conditional on the tested RAG configuration: at the 4B scale, fine-tuning yields a statistically significant gain while the chosen RAG does not. We do not claim that no possible RAG setup could ever close the gap. In the revised manuscript we will add an explicit limitations paragraph in the Discussion clarifying the scope of the claim and recommending that future comparisons test stronger retrievers and corpora. This preserves the value of the controlled comparison while acknowledging the referee's valid point about generalizability. revision: partial
-
Referee: Results (§4) and Discussion: The paper reports point estimates and McNemar p-values for the four conditions but does not provide per-question error analysis or breakdown by medical topic. Such an analysis would clarify whether the fine-tuning advantage is concentrated in areas where MedMCQA explanations are least relevant, directly testing the corpus-overlap concern raised above.
Authors: We accept this recommendation. The released JSONL traces contain per-question predictions across all four conditions, making such an analysis feasible. In the revision we will add a new subsection in Results that (i) reports accuracy stratified by available MedQA-USMLE subject categories and (ii) examines whether the fine-tuning advantage is larger on questions whose retrieved MedMCQA explanations exhibit low lexical or embedding overlap with the question stem. This directly tests the corpus-overlap hypothesis and will be accompanied by a brief discussion of any patterns observed. Because the traces are already public, the additional analysis can be performed without new model calls. revision: yes
Circularity Check
No circularity: purely empirical comparison with direct measurements
full rationale
The paper conducts a controlled 2x2 experiment measuring accuracy on MedQA-USMLE under fixed conditions, varying only domain adaptation and RAG presence. All reported gains, p-values, and conclusions are computed directly from the 15,276 LLM calls and majority votes; no equations, derivations, parameter fits, or predictions are defined in terms of the outputs. No self-citations are load-bearing, and the design contains no ansatz, uniqueness theorem, or renaming of prior results. The skeptic concern addresses experimental coverage rather than logical self-reference.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MedQA-USMLE is a valid and representative benchmark for medical multiple-choice question answering.
Reference graph
Works this paper leans on
-
[1]
Chroma: the ai-native open-source embedding database
Chroma. Chroma: the ai-native open-source embedding database. https://www.trychroma. com/, 2024
2024
-
[2]
Dietterich
Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms, 1998
1998
-
[3]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team and Google DeepMind. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
2021
-
[5]
Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨ aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020. 7
2020
-
[6]
Capabilities of GPT-4 on Medical Challenge Problems
Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Morris, Brandon Duderstadt, and Andriy Mulyar
Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder, 2024
2024
-
[8]
Ollama: Get up and running with large language models locally
Ollama Contributors. Ollama: Get up and running with large language models locally. https://ollama.com, 2024
2024
-
[9]
MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, pages 248–260, 2022
2022
-
[10]
Andrew Sellergren et al. MedGemma technical report.arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
2023
-
[12]
Yubo Wang, Xueguang Ma, and Wenhu Chen. Augmenting black-box LLMs with medical textbooks for biomedical question answering.arXiv preprint arXiv:2309.02233, 2024
-
[13]
Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024
Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval- augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024. 8
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.