pith. sign in

arxiv: 2512.14896 · v2 · pith:T4CU7WA7new · submitted 2025-12-16 · 💻 cs.CL · cs.AI

DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

Pith reviewed 2026-05-21 17:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords DrugRAGretrieval-augmented generationpharmacylarge language modelsquestion answeringaccuracy improvementexternal knowledge
0
0 comments X

The pith

DrugRAG boosts accuracy of large language models on pharmacy questions by 7 to 21 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates ten large language models on a set of 141 pharmacy licensure-style questions and finds that baseline accuracy varies widely, with top models near 90 percent and smaller ones around 46 percent. It then introduces DrugRAG, an external retrieval-augmented generation pipeline that pulls in structured drug information to enhance the prompts given to the models. This approach improves performance across all five tested models without any changes to the models themselves. A sympathetic reader would care because it offers a practical way to make AI tools more reliable for pharmacy tasks using existing models.

Core claim

DrugRAG is a three-step retrieval-augmented generation pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence. When applied to five LLMs on the 141-question dataset, it increased accuracy by 7 to 21 percentage points, with statistically significant gains mainly in smaller and mid-sized open-source models.

What carries the argument

DrugRAG, a three-step retrieval-augmented generation pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally without modifying model architecture or parameters.

Load-bearing premise

The external structured drug information retrieved by DrugRAG is accurate, up-to-date, and sufficiently relevant to the specific questions in the 141-question pharmacy dataset.

What would settle it

Running the same evaluation on a fresh set of pharmacy questions where the retrieved information is outdated or mismatched would show no accuracy gains or even losses, falsifying the benefit of the pipeline.

Figures

Figures reproduced from arXiv: 2512.14896 by Ali Sabzi, Armin Khosravi, Babak Khalaj, Farbod Davoodi, Fatemeh Latifi, Glolamali Aminian, Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Mohammad Hossein Rohban, MohammadReza KarimiNejad, Parham Abed Azad, Seyed Reza Tavakoli, Siavash Ahmadi, Tahereh Javaheri, Zohreh Amoozgar.

Figure 1
Figure 1. Figure 1: DrugRAG architecture for medication question answering. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of DrugRAG on LLM accuracy for pharmacy question [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

In our study, we evaluated large language model (LLM) performance on pharmacy licensure-style question-answering tasks and developed an external knowledge integration method to improve accuracy. We benchmarked ten LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset, measuring baseline accuracy without modification. Baseline performance ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores, while smaller open-source models showed substantially lower performance. We then developed DrugRAG, a three-step retrieval-augmented generation (RAG) pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally and requiring no changes to model architecture or parameters. DrugRAG increased accuracy across all five evaluated models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61.0% to 71%, Llama 3.1 8B: 46% to 67%). McNemar analyses demonstrated statistically significant paired improvements primarily in smaller and mid-sized open-source models. These findings demonstrate that integrating structured external drug knowledge via DrugRAG can improve LLM performance on pharmacy-focused question-answering tasks without modifying the underlying models, providing a practical pipeline for enhancing evidence-based pharmacy-focused AI applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates ten LLMs (8B to 70B+ parameters) on a 141-question pharmacy licensure-style dataset, reporting baseline accuracies ranging from 46% to 92%. It then introduces DrugRAG, a three-step external RAG pipeline that retrieves structured drug information to augment prompts, and demonstrates accuracy gains of 7–21 percentage points on five models (e.g., Llama 3.1 8B: 46% to 67%; Gemma 3 27B: 61% to 71%), with McNemar tests indicating statistically significant paired improvements, especially for smaller open-source models.

Significance. If the central empirical claims hold, the work demonstrates a practical, architecture-agnostic method for integrating external pharmacological knowledge into LLMs to improve performance on domain-specific QA tasks. The reported gains and use of paired statistical testing provide concrete evidence of RAG utility in pharmacy applications without requiring model retraining or fine-tuning.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Methods): The three-step retrieval process is described only at a high level; no source database, retrieval algorithm, indexing method, or protocol for validating the factual accuracy, currency, or question-specific relevance of the retrieved drug information is provided. This information is load-bearing for the claim that observed gains result from evidence-based augmentation rather than prompt-length effects or lexical artifacts.
  2. [§4 and Table 2] §4 (Results) and Table 2: The McNemar tests are reported as significant for smaller models, but without an accompanying error analysis or breakdown of question types where retrieval succeeded or failed, it is not possible to confirm that improvements track with the quality of the augmented pharmacological content.
minor comments (2)
  1. [§2] Clarify in §2 whether the 141-question dataset was sourced from public licensure exams or constructed internally, and report any overlap with the external knowledge base.
  2. [Figure 1] Figure 1 (pipeline diagram) would benefit from explicit labeling of the three retrieval steps and the exact format in which retrieved content is inserted into the prompt.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Methods): The three-step retrieval process is described only at a high level; no source database, retrieval algorithm, indexing method, or protocol for validating the factual accuracy, currency, or question-specific relevance of the retrieved drug information is provided. This information is load-bearing for the claim that observed gains result from evidence-based augmentation rather than prompt-length effects or lexical artifacts.

    Authors: We thank the referee for this observation. While §3 outlines the three-step DrugRAG process at a conceptual level, we agree that greater specificity is warranted to support the causal claim. In the revised manuscript we will expand the Methods section to name the source database (a curated pharmacological repository with structured drug records), describe the retrieval algorithm (embedding-based semantic similarity search), specify the indexing method (vector store with metadata filtering), and detail the validation protocol (automated relevance scoring plus manual review of a random sample for factual accuracy and currency). We will also add an ablation experiment that augments prompts with irrelevant text of matched length to isolate the contribution of pharmacological content. revision: yes

  2. Referee: [§4 and Table 2] §4 (Results) and Table 2: The McNemar tests are reported as significant for smaller models, but without an accompanying error analysis or breakdown of question types where retrieval succeeded or failed, it is not possible to confirm that improvements track with the quality of the augmented pharmacological content.

    Authors: We acknowledge the value of this suggestion. The current results section reports aggregate accuracy and McNemar p-values but does not include a fine-grained error analysis. In the revised version we will add to §4 a new subsection that categorizes the 141 questions (e.g., by topic: mechanism, dosing, interactions, adverse effects) and reports per-category accuracy deltas together with qualitative examples of successful versus unsuccessful retrieval. This analysis will be performed on the full set of paired responses and will directly link performance gains to the relevance and correctness of the retrieved drug information. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation on external benchmark

full rationale

The paper reports baseline LLM accuracies and subsequent gains from a three-step RAG pipeline on a fixed external 141-question pharmacy dataset. No mathematical derivations, equations, fitted parameters, or predictions appear in the abstract or described structure. Performance deltas are measured directly against the held-out questions rather than generated from internal fits or self-referential definitions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The results remain falsifiable by re-running the same models and retrieval steps on the same dataset, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from LLM evaluation and RAG literature without introducing new free parameters or postulated entities.

axioms (1)
  • domain assumption The 141-question dataset is representative of pharmacy licensure-style tasks.
    Used as the sole benchmark without reported validation or external corroboration in the abstract.

pith-pipeline@v0.9.0 · 5871 in / 1219 out tokens · 59291 ms · 2026-05-21T17:39:46.517535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    General-purpose models such as GPT-4 show high performance on medical education exams, yet pharmacy presents distinct challenges 3,4

    Introduction Large language models have created new opportunities for AI-supported learning in pharmacy and healthcare 1,2. General-purpose models such as GPT-4 show high performance on medical education exams, yet pharmacy presents distinct challenges 3,4. Pharmacists must master accurate drug selection, dose calculations, and context-specific decision-m...

  2. [2]

    Not disclosed

    Methods 2.1. Study Design and Question Set We evaluated eleven language models using 141 multiple-choice questions from PharmacyExam, a NAPLEX preparation resource 11. Questions span the five NAPLEX content domains: Foundational Knowledge for Pharmacy Practice (25%), Medication Use Process (25%), Person-Centered Assessment and Treatment Planning (40%), Pr...

  3. [3]

    Baseline Model Performance Across Parameter Scales Table 1 shows substantial variation in accuracy across models of different parameter sizes

    Results and Discussion 3.1. Baseline Model Performance Across Parameter Scales Table 1 shows substantial variation in accuracy across models of different parameter sizes. These baseline results reflect each model's inherent capabilities without any external knowledge augmentation: Bio-Medical Llama 3 8B and Llama 3.1 8B both achieved 46%, the lowest accur...

  4. [4]

    Limitations Our study has several limitations. First, while the 141-question benchmark covers broad pharmacy content aligned with NAPLEX domains, we did not conduct formal analysis of question difficulty distribution within each domain. This limits claims of comprehensive topical representation. Second, we evaluated only multiple-choice questions, which d...

  5. [5]

    Conclusion We benchmarked eleven large language models of varying parameter sizes on pharmacy question-answering tasks, revealing wide performance variation tied to model scale and training. We developed a three-step RAG pipeline, DrugRAG, that integrates structured drug knowledge externally, achieving 7-21 % point accuracy improvements across all tested ...

  6. [6]

    Declaration of Competing Interest The authors declare no conflicts of interest related to this study

  7. [7]

    Data Availability Aggregated model predictions and analysis code are available from the corresponding author on reasonable request

  8. [8]

    This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine

    Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine. The content may be updated following editorial review

  9. [9]

    Ethics statement This study did not involve human participants, animals, or access to identifiable patient data and therefore did not require institutional ethics committee approval

  10. [10]

    The authors reviewed and edited all content and take full responsibility for the publication

    Declaration of Generative AI and AI-assisted technologies During manuscript preparation, the authors used ChatGPT (GPT-4o) to improve readability and language. The authors reviewed and edited all content and take full responsibility for the publication. Supplementary Materials Supplementary File S1 contains the o3-generated reasoning traces for all 141 qu...

  11. [11]

    Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management

    Roosan D, Padua P, Khan R, Khan H, Verzosa C, Wu Y. Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management. Journal of the American Pharmacists Association. 2024;64(2):422-428. e8. doi: https://doi.org/10.1016/j.japh.2023.11.023

  12. [12]

    Large language models for preventing medication direction errors in online pharmacies

    Pais C, Liu J, Voigt R, Gupta V, Wade E, Bayati M. Large language models for preventing medication direction errors in online pharmacies. Nature medicine. 2024;30(6):1574-1582. doi: https://doi.org/10.1038/s41591-024-02933-8

  13. [13]

    Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

    Brin D, Sorin V, Vaid A, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports. 2023;13(1):16492. doi: https://doi.org/10.1038/s41598-023-43436-9

  14. [14]

    Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis

    Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC medical education. 2024;24(1):1013. doi: https://doi.org/10.1186/s12909-024-05944-8

  15. [15]

    Perceptions of pharmacists' roles in the era of expanding scopes of practice

    Schindel TJ, Yuksel N, Breault R, Daniels J, Varnhagen S, Hughes CA. Perceptions of pharmacists' roles in the era of expanding scopes of practice. Research in Social and Administrative Pharmacy. 2017;13(1):148-161. doi: https://doi.org/10.1016/j.sapharm.2016.02.007

  16. [16]

    The NAPLEX: evolution, purpose, scope, and educational implications

    Newton DW, Boyle M, Catizone CA. The NAPLEX: evolution, purpose, scope, and educational implications. American journal of pharmaceutical education. 2008;72(2):33. doi: https://doi.org/10.5688/aj720233 This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine. The content may be updated following e...

  17. [17]

    NAPLEX® Competency Statements and Test Specifications

    NABP. NAPLEX® Competency Statements and Test Specifications. National Association of Boards of Pharmacy. Accessed May 1, 2025, https://nabp.pharmacy/wp- content/uploads/NAPLEX-Content-Outline.pdf

  18. [18]

    Clinical knowledge and reasoning abilities of large language models in pharmacy: A comparative study on the naplex exam

    Angel M, Patel A, Alachkar A, Baldi P. Clinical knowledge and reasoning abilities of large language models in pharmacy: A comparative study on the naplex exam. IEEE; 2023:1- 4

  19. [19]

    Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions

    Ehlert A, Ehlert B, Cao B, Morbitzer K. Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions. American Journal of Pharmaceutical Education. 2024;88(11):101294. doi: https://doi.org/10.1016/j.ajpe.2024.101294

  20. [20]

    Performance of Large Language Models on Pharmacy Exam: A Comparative Assessment Using the NAPLEX

    Angel M, Xing H, Patel A, Alachkar A, Baldi P. Performance of Large Language Models on Pharmacy Exam: A Comparative Assessment Using the NAPLEX. bioRxiv. 2023:2023.12. 06.570434. doi: https://doi.org/10.1101/2023.12.06.570434

  21. [21]

    Accessed April 1, 2025, https://www.pharmacyexam.com

    PharmacyExam.com. Accessed April 1, 2025, https://www.pharmacyexam.com

  22. [22]

    Hugging Face

    Bio-Medical Llama 3 8B. Hugging Face. Accessed April 1, 2025, https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-8B

  23. [23]

    Hugging Face

    Llama 3.1 8B Instruct. Hugging Face. Accessed April 1, 2025, https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

  24. [24]

    Hugging Face

    Gemma 3 27B IT. Hugging Face. Accessed June 1, 2025, https://huggingface.co/google/gemma-3-27b-it

  25. [25]

    Accessed May 1, 2025, https://platform.openai.com/docs/models/gpt-4o

    OpenAI GPT-4o Model. Accessed May 1, 2025, https://platform.openai.com/docs/models/gpt-4o

  26. [26]

    GPT-5 Model

    OpenAI. GPT-5 Model. Accessed October 1, 2025, https://platform.openai.com/docs/models/gpt-5

  27. [27]

    Accessed May 1, 2025, https://platform.openai.com/docs/models/o3

    OpenAI o3 Model. Accessed May 1, 2025, https://platform.openai.com/docs/models/o3

  28. [28]

    Accessed May 1, 2025, https://platform.openai.com/docs/models/o4-mini

    OpenAI o4 Mini. Accessed May 1, 2025, https://platform.openai.com/docs/models/o4-mini

  29. [29]

    Google AI Developer

    Gemini 2.0 Flash Documentation. Google AI Developer. Accessed April 1, 2025, https://ai.google.dev/gemini-api/docs/models#gemini-2.0-flash

  30. [30]

    Google AI Developer

    Gemini 3 Pro Documentation. Google AI Developer. Accessed December 1, 2025, https://ai.google.dev/gemini-api/docs/models#gemini-3-pro

  31. [31]

    Accessed December 1, 2025, https://www.anthropic.com/claude/opus#claude-opus-4.5

    Claude Opus 4.5. Accessed December 1, 2025, https://www.anthropic.com/claude/opus#claude-opus-4.5

  32. [32]

    Medical Clinical QA System

    Medical Chat. Medical Clinical QA System. Accessed May 1, 2025, https://medical.chat-data.com

  33. [33]

    Accessed June 1, 2025, https://open.fda.gov/apis/

    OpenFDA API. Accessed June 1, 2025, https://open.fda.gov/apis/

  34. [34]

    Accessed June 1, 2025, https://drugcentral.org/OpenAPI

    DrugCentral OpenAPI. Accessed June 1, 2025, https://drugcentral.org/OpenAPI

  35. [35]

    API Documentation

    DrugBank Online. API Documentation. Accessed June 1, 2025, https://docs.drugbank.com/

  36. [36]

    Accessed June 1, 2025, https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html

    RxNorm API. Accessed June 1, 2025, https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html