DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline
Pith reviewed 2026-05-21 17:39 UTC · model grok-4.3
The pith
DrugRAG boosts accuracy of large language models on pharmacy questions by 7 to 21 percentage points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DrugRAG is a three-step retrieval-augmented generation pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence. When applied to five LLMs on the 141-question dataset, it increased accuracy by 7 to 21 percentage points, with statistically significant gains mainly in smaller and mid-sized open-source models.
What carries the argument
DrugRAG, a three-step retrieval-augmented generation pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally without modifying model architecture or parameters.
Load-bearing premise
The external structured drug information retrieved by DrugRAG is accurate, up-to-date, and sufficiently relevant to the specific questions in the 141-question pharmacy dataset.
What would settle it
Running the same evaluation on a fresh set of pharmacy questions where the retrieved information is outdated or mismatched would show no accuracy gains or even losses, falsifying the benefit of the pipeline.
Figures
read the original abstract
In our study, we evaluated large language model (LLM) performance on pharmacy licensure-style question-answering tasks and developed an external knowledge integration method to improve accuracy. We benchmarked ten LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset, measuring baseline accuracy without modification. Baseline performance ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores, while smaller open-source models showed substantially lower performance. We then developed DrugRAG, a three-step retrieval-augmented generation (RAG) pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally and requiring no changes to model architecture or parameters. DrugRAG increased accuracy across all five evaluated models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61.0% to 71%, Llama 3.1 8B: 46% to 67%). McNemar analyses demonstrated statistically significant paired improvements primarily in smaller and mid-sized open-source models. These findings demonstrate that integrating structured external drug knowledge via DrugRAG can improve LLM performance on pharmacy-focused question-answering tasks without modifying the underlying models, providing a practical pipeline for enhancing evidence-based pharmacy-focused AI applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates ten LLMs (8B to 70B+ parameters) on a 141-question pharmacy licensure-style dataset, reporting baseline accuracies ranging from 46% to 92%. It then introduces DrugRAG, a three-step external RAG pipeline that retrieves structured drug information to augment prompts, and demonstrates accuracy gains of 7–21 percentage points on five models (e.g., Llama 3.1 8B: 46% to 67%; Gemma 3 27B: 61% to 71%), with McNemar tests indicating statistically significant paired improvements, especially for smaller open-source models.
Significance. If the central empirical claims hold, the work demonstrates a practical, architecture-agnostic method for integrating external pharmacological knowledge into LLMs to improve performance on domain-specific QA tasks. The reported gains and use of paired statistical testing provide concrete evidence of RAG utility in pharmacy applications without requiring model retraining or fine-tuning.
major comments (2)
- [Abstract and §3] Abstract and §3 (Methods): The three-step retrieval process is described only at a high level; no source database, retrieval algorithm, indexing method, or protocol for validating the factual accuracy, currency, or question-specific relevance of the retrieved drug information is provided. This information is load-bearing for the claim that observed gains result from evidence-based augmentation rather than prompt-length effects or lexical artifacts.
- [§4 and Table 2] §4 (Results) and Table 2: The McNemar tests are reported as significant for smaller models, but without an accompanying error analysis or breakdown of question types where retrieval succeeded or failed, it is not possible to confirm that improvements track with the quality of the augmented pharmacological content.
minor comments (2)
- [§2] Clarify in §2 whether the 141-question dataset was sourced from public licensure exams or constructed internally, and report any overlap with the external knowledge base.
- [Figure 1] Figure 1 (pipeline diagram) would benefit from explicit labeling of the three retrieval steps and the exact format in which retrieved content is inserted into the prompt.
Simulated Author's Rebuttal
We are grateful to the referee for their detailed and constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Methods): The three-step retrieval process is described only at a high level; no source database, retrieval algorithm, indexing method, or protocol for validating the factual accuracy, currency, or question-specific relevance of the retrieved drug information is provided. This information is load-bearing for the claim that observed gains result from evidence-based augmentation rather than prompt-length effects or lexical artifacts.
Authors: We thank the referee for this observation. While §3 outlines the three-step DrugRAG process at a conceptual level, we agree that greater specificity is warranted to support the causal claim. In the revised manuscript we will expand the Methods section to name the source database (a curated pharmacological repository with structured drug records), describe the retrieval algorithm (embedding-based semantic similarity search), specify the indexing method (vector store with metadata filtering), and detail the validation protocol (automated relevance scoring plus manual review of a random sample for factual accuracy and currency). We will also add an ablation experiment that augments prompts with irrelevant text of matched length to isolate the contribution of pharmacological content. revision: yes
-
Referee: [§4 and Table 2] §4 (Results) and Table 2: The McNemar tests are reported as significant for smaller models, but without an accompanying error analysis or breakdown of question types where retrieval succeeded or failed, it is not possible to confirm that improvements track with the quality of the augmented pharmacological content.
Authors: We acknowledge the value of this suggestion. The current results section reports aggregate accuracy and McNemar p-values but does not include a fine-grained error analysis. In the revised version we will add to §4 a new subsection that categorizes the 141 questions (e.g., by topic: mechanism, dosing, interactions, adverse effects) and reports per-category accuracy deltas together with qualitative examples of successful versus unsuccessful retrieval. This analysis will be performed on the full set of paired responses and will directly link performance gains to the relevance and correctness of the retrieved drug information. revision: yes
Circularity Check
No significant circularity; purely empirical evaluation on external benchmark
full rationale
The paper reports baseline LLM accuracies and subsequent gains from a three-step RAG pipeline on a fixed external 141-question pharmacy dataset. No mathematical derivations, equations, fitted parameters, or predictions appear in the abstract or described structure. Performance deltas are measured directly against the held-out questions rather than generated from internal fits or self-referential definitions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The results remain falsifiable by re-running the same models and retrieval steps on the same dataset, satisfying the criteria for a self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 141-question dataset is representative of pharmacy licensure-style tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction Large language models have created new opportunities for AI-supported learning in pharmacy and healthcare 1,2. General-purpose models such as GPT-4 show high performance on medical education exams, yet pharmacy presents distinct challenges 3,4. Pharmacists must master accurate drug selection, dose calculations, and context-specific decision-m...
-
[2]
Methods 2.1. Study Design and Question Set We evaluated eleven language models using 141 multiple-choice questions from PharmacyExam, a NAPLEX preparation resource 11. Questions span the five NAPLEX content domains: Foundational Knowledge for Pharmacy Practice (25%), Medication Use Process (25%), Person-Centered Assessment and Treatment Planning (40%), Pr...
-
[3]
Results and Discussion 3.1. Baseline Model Performance Across Parameter Scales Table 1 shows substantial variation in accuracy across models of different parameter sizes. These baseline results reflect each model's inherent capabilities without any external knowledge augmentation: Bio-Medical Llama 3 8B and Llama 3.1 8B both achieved 46%, the lowest accur...
-
[4]
Limitations Our study has several limitations. First, while the 141-question benchmark covers broad pharmacy content aligned with NAPLEX domains, we did not conduct formal analysis of question difficulty distribution within each domain. This limits claims of comprehensive topical representation. Second, we evaluated only multiple-choice questions, which d...
-
[5]
Conclusion We benchmarked eleven large language models of varying parameter sizes on pharmacy question-answering tasks, revealing wide performance variation tied to model scale and training. We developed a three-step RAG pipeline, DrugRAG, that integrates structured drug knowledge externally, achieving 7-21 % point accuracy improvements across all tested ...
-
[6]
Declaration of Competing Interest The authors declare no conflicts of interest related to this study
-
[7]
Data Availability Aggregated model predictions and analysis code are available from the corresponding author on reasonable request
-
[8]
Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine. The content may be updated following editorial review
-
[9]
Ethics statement This study did not involve human participants, animals, or access to identifiable patient data and therefore did not require institutional ethics committee approval
-
[10]
The authors reviewed and edited all content and take full responsibility for the publication
Declaration of Generative AI and AI-assisted technologies During manuscript preparation, the authors used ChatGPT (GPT-4o) to improve readability and language. The authors reviewed and edited all content and take full responsibility for the publication. Supplementary Materials Supplementary File S1 contains the o3-generated reasoning traces for all 141 qu...
-
[11]
Roosan D, Padua P, Khan R, Khan H, Verzosa C, Wu Y. Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management. Journal of the American Pharmacists Association. 2024;64(2):422-428. e8. doi: https://doi.org/10.1016/j.japh.2023.11.023
-
[12]
Large language models for preventing medication direction errors in online pharmacies
Pais C, Liu J, Voigt R, Gupta V, Wade E, Bayati M. Large language models for preventing medication direction errors in online pharmacies. Nature medicine. 2024;30(6):1574-1582. doi: https://doi.org/10.1038/s41591-024-02933-8
-
[13]
Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments
Brin D, Sorin V, Vaid A, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports. 2023;13(1):16492. doi: https://doi.org/10.1038/s41598-023-43436-9
-
[14]
Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC medical education. 2024;24(1):1013. doi: https://doi.org/10.1186/s12909-024-05944-8
-
[15]
Perceptions of pharmacists' roles in the era of expanding scopes of practice
Schindel TJ, Yuksel N, Breault R, Daniels J, Varnhagen S, Hughes CA. Perceptions of pharmacists' roles in the era of expanding scopes of practice. Research in Social and Administrative Pharmacy. 2017;13(1):148-161. doi: https://doi.org/10.1016/j.sapharm.2016.02.007
-
[16]
The NAPLEX: evolution, purpose, scope, and educational implications
Newton DW, Boyle M, Catizone CA. The NAPLEX: evolution, purpose, scope, and educational implications. American journal of pharmaceutical education. 2008;72(2):33. doi: https://doi.org/10.5688/aj720233 This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine. The content may be updated following e...
-
[17]
NAPLEX® Competency Statements and Test Specifications
NABP. NAPLEX® Competency Statements and Test Specifications. National Association of Boards of Pharmacy. Accessed May 1, 2025, https://nabp.pharmacy/wp- content/uploads/NAPLEX-Content-Outline.pdf
work page 2025
-
[18]
Angel M, Patel A, Alachkar A, Baldi P. Clinical knowledge and reasoning abilities of large language models in pharmacy: A comparative study on the naplex exam. IEEE; 2023:1- 4
work page 2023
-
[19]
Ehlert A, Ehlert B, Cao B, Morbitzer K. Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions. American Journal of Pharmaceutical Education. 2024;88(11):101294. doi: https://doi.org/10.1016/j.ajpe.2024.101294
-
[20]
Performance of Large Language Models on Pharmacy Exam: A Comparative Assessment Using the NAPLEX
Angel M, Xing H, Patel A, Alachkar A, Baldi P. Performance of Large Language Models on Pharmacy Exam: A Comparative Assessment Using the NAPLEX. bioRxiv. 2023:2023.12. 06.570434. doi: https://doi.org/10.1101/2023.12.06.570434
-
[21]
Accessed April 1, 2025, https://www.pharmacyexam.com
PharmacyExam.com. Accessed April 1, 2025, https://www.pharmacyexam.com
work page 2025
-
[22]
Bio-Medical Llama 3 8B. Hugging Face. Accessed April 1, 2025, https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-8B
work page 2025
-
[23]
Llama 3.1 8B Instruct. Hugging Face. Accessed April 1, 2025, https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
work page 2025
-
[24]
Gemma 3 27B IT. Hugging Face. Accessed June 1, 2025, https://huggingface.co/google/gemma-3-27b-it
work page 2025
-
[25]
Accessed May 1, 2025, https://platform.openai.com/docs/models/gpt-4o
OpenAI GPT-4o Model. Accessed May 1, 2025, https://platform.openai.com/docs/models/gpt-4o
work page 2025
-
[26]
OpenAI. GPT-5 Model. Accessed October 1, 2025, https://platform.openai.com/docs/models/gpt-5
work page 2025
-
[27]
Accessed May 1, 2025, https://platform.openai.com/docs/models/o3
OpenAI o3 Model. Accessed May 1, 2025, https://platform.openai.com/docs/models/o3
work page 2025
-
[28]
Accessed May 1, 2025, https://platform.openai.com/docs/models/o4-mini
OpenAI o4 Mini. Accessed May 1, 2025, https://platform.openai.com/docs/models/o4-mini
work page 2025
-
[29]
Gemini 2.0 Flash Documentation. Google AI Developer. Accessed April 1, 2025, https://ai.google.dev/gemini-api/docs/models#gemini-2.0-flash
work page 2025
-
[30]
Gemini 3 Pro Documentation. Google AI Developer. Accessed December 1, 2025, https://ai.google.dev/gemini-api/docs/models#gemini-3-pro
work page 2025
-
[31]
Accessed December 1, 2025, https://www.anthropic.com/claude/opus#claude-opus-4.5
Claude Opus 4.5. Accessed December 1, 2025, https://www.anthropic.com/claude/opus#claude-opus-4.5
work page 2025
-
[32]
Medical Chat. Medical Clinical QA System. Accessed May 1, 2025, https://medical.chat-data.com
work page 2025
-
[33]
Accessed June 1, 2025, https://open.fda.gov/apis/
OpenFDA API. Accessed June 1, 2025, https://open.fda.gov/apis/
work page 2025
-
[34]
Accessed June 1, 2025, https://drugcentral.org/OpenAPI
DrugCentral OpenAPI. Accessed June 1, 2025, https://drugcentral.org/OpenAPI
work page 2025
-
[35]
DrugBank Online. API Documentation. Accessed June 1, 2025, https://docs.drugbank.com/
work page 2025
-
[36]
Accessed June 1, 2025, https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html
RxNorm API. Accessed June 1, 2025, https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.