DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

Ali Sabzi; Armin Khosravi; Babak Khalaj; Farbod Davoodi; Fatemeh Latifi; Glolamali Aminian; Houman Kazemzadeh; Kiarash Mokhtari Dizaji; Mohammad Hossein Rohban; MohammadReza KarimiNejad

arxiv: 2512.14896 · v2 · pith:T4CU7WA7new · submitted 2025-12-16 · 💻 cs.CL · cs.AI

DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

Houman Kazemzadeh , Kiarash Mokhtari Dizaji , Seyed Reza Tavakoli , Farbod Davoodi , MohammadReza KarimiNejad , Parham Abed Azad , Fatemeh Latifi , Ali Sabzi

show 7 more authors

Armin Khosravi Siavash Ahmadi Babak Khalaj Mohammad Hossein Rohban Glolamali Aminian Zohreh Amoozgar Tahereh Javaheri

This is my paper

Pith reviewed 2026-05-21 17:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords DrugRAGretrieval-augmented generationpharmacylarge language modelsquestion answeringaccuracy improvementexternal knowledge

0 comments

The pith

DrugRAG boosts accuracy of large language models on pharmacy questions by 7 to 21 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates ten large language models on a set of 141 pharmacy licensure-style questions and finds that baseline accuracy varies widely, with top models near 90 percent and smaller ones around 46 percent. It then introduces DrugRAG, an external retrieval-augmented generation pipeline that pulls in structured drug information to enhance the prompts given to the models. This approach improves performance across all five tested models without any changes to the models themselves. A sympathetic reader would care because it offers a practical way to make AI tools more reliable for pharmacy tasks using existing models.

Core claim

DrugRAG is a three-step retrieval-augmented generation pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence. When applied to five LLMs on the 141-question dataset, it increased accuracy by 7 to 21 percentage points, with statistically significant gains mainly in smaller and mid-sized open-source models.

What carries the argument

DrugRAG, a three-step retrieval-augmented generation pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally without modifying model architecture or parameters.

Load-bearing premise

The external structured drug information retrieved by DrugRAG is accurate, up-to-date, and sufficiently relevant to the specific questions in the 141-question pharmacy dataset.

What would settle it

Running the same evaluation on a fresh set of pharmacy questions where the retrieved information is outdated or mismatched would show no accuracy gains or even losses, falsifying the benefit of the pipeline.

Figures

Figures reproduced from arXiv: 2512.14896 by Ali Sabzi, Armin Khosravi, Babak Khalaj, Farbod Davoodi, Fatemeh Latifi, Glolamali Aminian, Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Mohammad Hossein Rohban, MohammadReza KarimiNejad, Parham Abed Azad, Seyed Reza Tavakoli, Siavash Ahmadi, Tahereh Javaheri, Zohreh Amoozgar.

**Figure 2.** Figure 2: Effect of DrugRAG on LLM accuracy for pharmacy question [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

In our study, we evaluated large language model (LLM) performance on pharmacy licensure-style question-answering tasks and developed an external knowledge integration method to improve accuracy. We benchmarked ten LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset, measuring baseline accuracy without modification. Baseline performance ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores, while smaller open-source models showed substantially lower performance. We then developed DrugRAG, a three-step retrieval-augmented generation (RAG) pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally and requiring no changes to model architecture or parameters. DrugRAG increased accuracy across all five evaluated models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61.0% to 71%, Llama 3.1 8B: 46% to 67%). McNemar analyses demonstrated statistically significant paired improvements primarily in smaller and mid-sized open-source models. These findings demonstrate that integrating structured external drug knowledge via DrugRAG can improve LLM performance on pharmacy-focused question-answering tasks without modifying the underlying models, providing a practical pipeline for enhancing evidence-based pharmacy-focused AI applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates ten LLMs (8B to 70B+ parameters) on a 141-question pharmacy licensure-style dataset, reporting baseline accuracies ranging from 46% to 92%. It then introduces DrugRAG, a three-step external RAG pipeline that retrieves structured drug information to augment prompts, and demonstrates accuracy gains of 7–21 percentage points on five models (e.g., Llama 3.1 8B: 46% to 67%; Gemma 3 27B: 61% to 71%), with McNemar tests indicating statistically significant paired improvements, especially for smaller open-source models.

Significance. If the central empirical claims hold, the work demonstrates a practical, architecture-agnostic method for integrating external pharmacological knowledge into LLMs to improve performance on domain-specific QA tasks. The reported gains and use of paired statistical testing provide concrete evidence of RAG utility in pharmacy applications without requiring model retraining or fine-tuning.

major comments (2)

[Abstract and §3] Abstract and §3 (Methods): The three-step retrieval process is described only at a high level; no source database, retrieval algorithm, indexing method, or protocol for validating the factual accuracy, currency, or question-specific relevance of the retrieved drug information is provided. This information is load-bearing for the claim that observed gains result from evidence-based augmentation rather than prompt-length effects or lexical artifacts.
[§4 and Table 2] §4 (Results) and Table 2: The McNemar tests are reported as significant for smaller models, but without an accompanying error analysis or breakdown of question types where retrieval succeeded or failed, it is not possible to confirm that improvements track with the quality of the augmented pharmacological content.

minor comments (2)

[§2] Clarify in §2 whether the 141-question dataset was sourced from public licensure exams or constructed internally, and report any overlap with the external knowledge base.
[Figure 1] Figure 1 (pipeline diagram) would benefit from explicit labeling of the three retrieval steps and the exact format in which retrieved content is inserted into the prompt.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and strengthen the evidence for our claims.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Methods): The three-step retrieval process is described only at a high level; no source database, retrieval algorithm, indexing method, or protocol for validating the factual accuracy, currency, or question-specific relevance of the retrieved drug information is provided. This information is load-bearing for the claim that observed gains result from evidence-based augmentation rather than prompt-length effects or lexical artifacts.

Authors: We thank the referee for this observation. While §3 outlines the three-step DrugRAG process at a conceptual level, we agree that greater specificity is warranted to support the causal claim. In the revised manuscript we will expand the Methods section to name the source database (a curated pharmacological repository with structured drug records), describe the retrieval algorithm (embedding-based semantic similarity search), specify the indexing method (vector store with metadata filtering), and detail the validation protocol (automated relevance scoring plus manual review of a random sample for factual accuracy and currency). We will also add an ablation experiment that augments prompts with irrelevant text of matched length to isolate the contribution of pharmacological content. revision: yes
Referee: [§4 and Table 2] §4 (Results) and Table 2: The McNemar tests are reported as significant for smaller models, but without an accompanying error analysis or breakdown of question types where retrieval succeeded or failed, it is not possible to confirm that improvements track with the quality of the augmented pharmacological content.

Authors: We acknowledge the value of this suggestion. The current results section reports aggregate accuracy and McNemar p-values but does not include a fine-grained error analysis. In the revised version we will add to §4 a new subsection that categorizes the 141 questions (e.g., by topic: mechanism, dosing, interactions, adverse effects) and reports per-category accuracy deltas together with qualitative examples of successful versus unsuccessful retrieval. This analysis will be performed on the full set of paired responses and will directly link performance gains to the relevance and correctness of the retrieved drug information. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation on external benchmark

full rationale

The paper reports baseline LLM accuracies and subsequent gains from a three-step RAG pipeline on a fixed external 141-question pharmacy dataset. No mathematical derivations, equations, fitted parameters, or predictions appear in the abstract or described structure. Performance deltas are measured directly against the held-out questions rather than generated from internal fits or self-referential definitions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The results remain falsifiable by re-running the same models and retrieval steps on the same dataset, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from LLM evaluation and RAG literature without introducing new free parameters or postulated entities.

axioms (1)

domain assumption The 141-question dataset is representative of pharmacy licensure-style tasks.
Used as the sole benchmark without reported validation or external corroboration in the abstract.

pith-pipeline@v0.9.0 · 5871 in / 1219 out tokens · 59291 ms · 2026-05-21T17:39:46.517535+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

General-purpose models such as GPT-4 show high performance on medical education exams, yet pharmacy presents distinct challenges 3,4

Introduction Large language models have created new opportunities for AI-supported learning in pharmacy and healthcare 1,2. General-purpose models such as GPT-4 show high performance on medical education exams, yet pharmacy presents distinct challenges 3,4. Pharmacists must master accurate drug selection, dose calculations, and context-specific decision-m...

work page
[2]

Not disclosed

Methods 2.1. Study Design and Question Set We evaluated eleven language models using 141 multiple-choice questions from PharmacyExam, a NAPLEX preparation resource 11. Questions span the five NAPLEX content domains: Foundational Knowledge for Pharmacy Practice (25%), Medication Use Process (25%), Person-Centered Assessment and Treatment Planning (40%), Pr...

work page
[3]

Baseline Model Performance Across Parameter Scales Table 1 shows substantial variation in accuracy across models of different parameter sizes

Results and Discussion 3.1. Baseline Model Performance Across Parameter Scales Table 1 shows substantial variation in accuracy across models of different parameter sizes. These baseline results reflect each model's inherent capabilities without any external knowledge augmentation: Bio-Medical Llama 3 8B and Llama 3.1 8B both achieved 46%, the lowest accur...

work page
[4]

Limitations Our study has several limitations. First, while the 141-question benchmark covers broad pharmacy content aligned with NAPLEX domains, we did not conduct formal analysis of question difficulty distribution within each domain. This limits claims of comprehensive topical representation. Second, we evaluated only multiple-choice questions, which d...

work page
[5]

Conclusion We benchmarked eleven large language models of varying parameter sizes on pharmacy question-answering tasks, revealing wide performance variation tied to model scale and training. We developed a three-step RAG pipeline, DrugRAG, that integrates structured drug knowledge externally, achieving 7-21 % point accuracy improvements across all tested ...

work page
[6]

Declaration of Competing Interest The authors declare no conflicts of interest related to this study

work page
[7]

Data Availability Aggregated model predictions and analysis code are available from the corresponding author on reasonable request

work page
[8]

This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine

Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine. The content may be updated following editorial review

work page
[9]

Ethics statement This study did not involve human participants, animals, or access to identifiable patient data and therefore did not require institutional ethics committee approval

work page
[10]

The authors reviewed and edited all content and take full responsibility for the publication

Declaration of Generative AI and AI-assisted technologies During manuscript preparation, the authors used ChatGPT (GPT-4o) to improve readability and language. The authors reviewed and edited all content and take full responsibility for the publication. Supplementary Materials Supplementary File S1 contains the o3-generated reasoning traces for all 141 qu...

work page
[11]

Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management

Roosan D, Padua P, Khan R, Khan H, Verzosa C, Wu Y. Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management. Journal of the American Pharmacists Association. 2024;64(2):422-428. e8. doi: https://doi.org/10.1016/j.japh.2023.11.023

work page doi:10.1016/j.japh.2023.11.023 2024
[12]

Large language models for preventing medication direction errors in online pharmacies

Pais C, Liu J, Voigt R, Gupta V, Wade E, Bayati M. Large language models for preventing medication direction errors in online pharmacies. Nature medicine. 2024;30(6):1574-1582. doi: https://doi.org/10.1038/s41591-024-02933-8

work page doi:10.1038/s41591-024-02933-8 2024
[13]

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Brin D, Sorin V, Vaid A, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports. 2023;13(1):16492. doi: https://doi.org/10.1038/s41598-023-43436-9

work page doi:10.1038/s41598-023-43436-9 2023
[14]

Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis

Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC medical education. 2024;24(1):1013. doi: https://doi.org/10.1186/s12909-024-05944-8

work page doi:10.1186/s12909-024-05944-8 2024
[15]

Perceptions of pharmacists' roles in the era of expanding scopes of practice

Schindel TJ, Yuksel N, Breault R, Daniels J, Varnhagen S, Hughes CA. Perceptions of pharmacists' roles in the era of expanding scopes of practice. Research in Social and Administrative Pharmacy. 2017;13(1):148-161. doi: https://doi.org/10.1016/j.sapharm.2016.02.007

work page doi:10.1016/j.sapharm.2016.02.007 2017
[16]

The NAPLEX: evolution, purpose, scope, and educational implications

Newton DW, Boyle M, Catizone CA. The NAPLEX: evolution, purpose, scope, and educational implications. American journal of pharmaceutical education. 2008;72(2):33. doi: https://doi.org/10.5688/aj720233 This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine. The content may be updated following e...

work page doi:10.5688/aj720233 2008
[17]

NAPLEX® Competency Statements and Test Specifications

NABP. NAPLEX® Competency Statements and Test Specifications. National Association of Boards of Pharmacy. Accessed May 1, 2025, https://nabp.pharmacy/wp- content/uploads/NAPLEX-Content-Outline.pdf

work page 2025
[18]

Clinical knowledge and reasoning abilities of large language models in pharmacy: A comparative study on the naplex exam

Angel M, Patel A, Alachkar A, Baldi P. Clinical knowledge and reasoning abilities of large language models in pharmacy: A comparative study on the naplex exam. IEEE; 2023:1- 4

work page 2023
[19]

Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions

Ehlert A, Ehlert B, Cao B, Morbitzer K. Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions. American Journal of Pharmaceutical Education. 2024;88(11):101294. doi: https://doi.org/10.1016/j.ajpe.2024.101294

work page doi:10.1016/j.ajpe.2024.101294 2024
[20]

Performance of Large Language Models on Pharmacy Exam: A Comparative Assessment Using the NAPLEX

Angel M, Xing H, Patel A, Alachkar A, Baldi P. Performance of Large Language Models on Pharmacy Exam: A Comparative Assessment Using the NAPLEX. bioRxiv. 2023:2023.12. 06.570434. doi: https://doi.org/10.1101/2023.12.06.570434

work page doi:10.1101/2023.12.06.570434 2023
[21]

Accessed April 1, 2025, https://www.pharmacyexam.com

PharmacyExam.com. Accessed April 1, 2025, https://www.pharmacyexam.com

work page 2025
[22]

Hugging Face

Bio-Medical Llama 3 8B. Hugging Face. Accessed April 1, 2025, https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-8B

work page 2025
[23]

Hugging Face

Llama 3.1 8B Instruct. Hugging Face. Accessed April 1, 2025, https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

work page 2025
[24]

Hugging Face

Gemma 3 27B IT. Hugging Face. Accessed June 1, 2025, https://huggingface.co/google/gemma-3-27b-it

work page 2025
[25]

Accessed May 1, 2025, https://platform.openai.com/docs/models/gpt-4o

OpenAI GPT-4o Model. Accessed May 1, 2025, https://platform.openai.com/docs/models/gpt-4o

work page 2025
[26]

GPT-5 Model

OpenAI. GPT-5 Model. Accessed October 1, 2025, https://platform.openai.com/docs/models/gpt-5

work page 2025
[27]

Accessed May 1, 2025, https://platform.openai.com/docs/models/o3

OpenAI o3 Model. Accessed May 1, 2025, https://platform.openai.com/docs/models/o3

work page 2025
[28]

Accessed May 1, 2025, https://platform.openai.com/docs/models/o4-mini

OpenAI o4 Mini. Accessed May 1, 2025, https://platform.openai.com/docs/models/o4-mini

work page 2025
[29]

Google AI Developer

Gemini 2.0 Flash Documentation. Google AI Developer. Accessed April 1, 2025, https://ai.google.dev/gemini-api/docs/models#gemini-2.0-flash

work page 2025
[30]

Google AI Developer

Gemini 3 Pro Documentation. Google AI Developer. Accessed December 1, 2025, https://ai.google.dev/gemini-api/docs/models#gemini-3-pro

work page 2025
[31]

Accessed December 1, 2025, https://www.anthropic.com/claude/opus#claude-opus-4.5

Claude Opus 4.5. Accessed December 1, 2025, https://www.anthropic.com/claude/opus#claude-opus-4.5

work page 2025
[32]

Medical Clinical QA System

Medical Chat. Medical Clinical QA System. Accessed May 1, 2025, https://medical.chat-data.com

work page 2025
[33]

Accessed June 1, 2025, https://open.fda.gov/apis/

OpenFDA API. Accessed June 1, 2025, https://open.fda.gov/apis/

work page 2025
[34]

Accessed June 1, 2025, https://drugcentral.org/OpenAPI

DrugCentral OpenAPI. Accessed June 1, 2025, https://drugcentral.org/OpenAPI

work page 2025
[35]

API Documentation

DrugBank Online. API Documentation. Accessed June 1, 2025, https://docs.drugbank.com/

work page 2025
[36]

Accessed June 1, 2025, https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html

RxNorm API. Accessed June 1, 2025, https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html

work page 2025

[1] [1]

General-purpose models such as GPT-4 show high performance on medical education exams, yet pharmacy presents distinct challenges 3,4

Introduction Large language models have created new opportunities for AI-supported learning in pharmacy and healthcare 1,2. General-purpose models such as GPT-4 show high performance on medical education exams, yet pharmacy presents distinct challenges 3,4. Pharmacists must master accurate drug selection, dose calculations, and context-specific decision-m...

work page

[2] [2]

Not disclosed

Methods 2.1. Study Design and Question Set We evaluated eleven language models using 141 multiple-choice questions from PharmacyExam, a NAPLEX preparation resource 11. Questions span the five NAPLEX content domains: Foundational Knowledge for Pharmacy Practice (25%), Medication Use Process (25%), Person-Centered Assessment and Treatment Planning (40%), Pr...

work page

[3] [3]

Baseline Model Performance Across Parameter Scales Table 1 shows substantial variation in accuracy across models of different parameter sizes

Results and Discussion 3.1. Baseline Model Performance Across Parameter Scales Table 1 shows substantial variation in accuracy across models of different parameter sizes. These baseline results reflect each model's inherent capabilities without any external knowledge augmentation: Bio-Medical Llama 3 8B and Llama 3.1 8B both achieved 46%, the lowest accur...

work page

[4] [4]

Limitations Our study has several limitations. First, while the 141-question benchmark covers broad pharmacy content aligned with NAPLEX domains, we did not conduct formal analysis of question difficulty distribution within each domain. This limits claims of comprehensive topical representation. Second, we evaluated only multiple-choice questions, which d...

work page

[5] [5]

Conclusion We benchmarked eleven large language models of varying parameter sizes on pharmacy question-answering tasks, revealing wide performance variation tied to model scale and training. We developed a three-step RAG pipeline, DrugRAG, that integrates structured drug knowledge externally, achieving 7-21 % point accuracy improvements across all tested ...

work page

[6] [6]

Declaration of Competing Interest The authors declare no conflicts of interest related to this study

work page

[7] [7]

Data Availability Aggregated model predictions and analysis code are available from the corresponding author on reasonable request

work page

[8] [8]

This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine

Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine. The content may be updated following editorial review

work page

[9] [9]

Ethics statement This study did not involve human participants, animals, or access to identifiable patient data and therefore did not require institutional ethics committee approval

work page

[10] [10]

The authors reviewed and edited all content and take full responsibility for the publication

Declaration of Generative AI and AI-assisted technologies During manuscript preparation, the authors used ChatGPT (GPT-4o) to improve readability and language. The authors reviewed and edited all content and take full responsibility for the publication. Supplementary Materials Supplementary File S1 contains the o3-generated reasoning traces for all 141 qu...

work page

[11] [11]

Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management

Roosan D, Padua P, Khan R, Khan H, Verzosa C, Wu Y. Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management. Journal of the American Pharmacists Association. 2024;64(2):422-428. e8. doi: https://doi.org/10.1016/j.japh.2023.11.023

work page doi:10.1016/j.japh.2023.11.023 2024

[12] [12]

Large language models for preventing medication direction errors in online pharmacies

Pais C, Liu J, Voigt R, Gupta V, Wade E, Bayati M. Large language models for preventing medication direction errors in online pharmacies. Nature medicine. 2024;30(6):1574-1582. doi: https://doi.org/10.1038/s41591-024-02933-8

work page doi:10.1038/s41591-024-02933-8 2024

[13] [13]

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Brin D, Sorin V, Vaid A, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports. 2023;13(1):16492. doi: https://doi.org/10.1038/s41598-023-43436-9

work page doi:10.1038/s41598-023-43436-9 2023

[14] [14]

Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis

Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC medical education. 2024;24(1):1013. doi: https://doi.org/10.1186/s12909-024-05944-8

work page doi:10.1186/s12909-024-05944-8 2024

[15] [15]

Perceptions of pharmacists' roles in the era of expanding scopes of practice

Schindel TJ, Yuksel N, Breault R, Daniels J, Varnhagen S, Hughes CA. Perceptions of pharmacists' roles in the era of expanding scopes of practice. Research in Social and Administrative Pharmacy. 2017;13(1):148-161. doi: https://doi.org/10.1016/j.sapharm.2016.02.007

work page doi:10.1016/j.sapharm.2016.02.007 2017

[16] [16]

The NAPLEX: evolution, purpose, scope, and educational implications

Newton DW, Boyle M, Catizone CA. The NAPLEX: evolution, purpose, scope, and educational implications. American journal of pharmaceutical education. 2008;72(2):33. doi: https://doi.org/10.5688/aj720233 This manuscript is a preprint and has been submitted for peer review to Computer Methods and Programs in Biomedicine. The content may be updated following e...

work page doi:10.5688/aj720233 2008

[17] [17]

NAPLEX® Competency Statements and Test Specifications

NABP. NAPLEX® Competency Statements and Test Specifications. National Association of Boards of Pharmacy. Accessed May 1, 2025, https://nabp.pharmacy/wp- content/uploads/NAPLEX-Content-Outline.pdf

work page 2025

[18] [18]

Clinical knowledge and reasoning abilities of large language models in pharmacy: A comparative study on the naplex exam

Angel M, Patel A, Alachkar A, Baldi P. Clinical knowledge and reasoning abilities of large language models in pharmacy: A comparative study on the naplex exam. IEEE; 2023:1- 4

work page 2023

[19] [19]

Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions

Ehlert A, Ehlert B, Cao B, Morbitzer K. Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions. American Journal of Pharmaceutical Education. 2024;88(11):101294. doi: https://doi.org/10.1016/j.ajpe.2024.101294

work page doi:10.1016/j.ajpe.2024.101294 2024

[20] [20]

Performance of Large Language Models on Pharmacy Exam: A Comparative Assessment Using the NAPLEX

Angel M, Xing H, Patel A, Alachkar A, Baldi P. Performance of Large Language Models on Pharmacy Exam: A Comparative Assessment Using the NAPLEX. bioRxiv. 2023:2023.12. 06.570434. doi: https://doi.org/10.1101/2023.12.06.570434

work page doi:10.1101/2023.12.06.570434 2023

[21] [21]

Accessed April 1, 2025, https://www.pharmacyexam.com

PharmacyExam.com. Accessed April 1, 2025, https://www.pharmacyexam.com

work page 2025

[22] [22]

Hugging Face

Bio-Medical Llama 3 8B. Hugging Face. Accessed April 1, 2025, https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-8B

work page 2025

[23] [23]

Hugging Face

Llama 3.1 8B Instruct. Hugging Face. Accessed April 1, 2025, https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

work page 2025

[24] [24]

Hugging Face

Gemma 3 27B IT. Hugging Face. Accessed June 1, 2025, https://huggingface.co/google/gemma-3-27b-it

work page 2025

[25] [25]

Accessed May 1, 2025, https://platform.openai.com/docs/models/gpt-4o

OpenAI GPT-4o Model. Accessed May 1, 2025, https://platform.openai.com/docs/models/gpt-4o

work page 2025

[26] [26]

GPT-5 Model

OpenAI. GPT-5 Model. Accessed October 1, 2025, https://platform.openai.com/docs/models/gpt-5

work page 2025

[27] [27]

Accessed May 1, 2025, https://platform.openai.com/docs/models/o3

OpenAI o3 Model. Accessed May 1, 2025, https://platform.openai.com/docs/models/o3

work page 2025

[28] [28]

Accessed May 1, 2025, https://platform.openai.com/docs/models/o4-mini

OpenAI o4 Mini. Accessed May 1, 2025, https://platform.openai.com/docs/models/o4-mini

work page 2025

[29] [29]

Google AI Developer

Gemini 2.0 Flash Documentation. Google AI Developer. Accessed April 1, 2025, https://ai.google.dev/gemini-api/docs/models#gemini-2.0-flash

work page 2025

[30] [30]

Google AI Developer

Gemini 3 Pro Documentation. Google AI Developer. Accessed December 1, 2025, https://ai.google.dev/gemini-api/docs/models#gemini-3-pro

work page 2025

[31] [31]

Accessed December 1, 2025, https://www.anthropic.com/claude/opus#claude-opus-4.5

Claude Opus 4.5. Accessed December 1, 2025, https://www.anthropic.com/claude/opus#claude-opus-4.5

work page 2025

[32] [32]

Medical Clinical QA System

Medical Chat. Medical Clinical QA System. Accessed May 1, 2025, https://medical.chat-data.com

work page 2025

[33] [33]

Accessed June 1, 2025, https://open.fda.gov/apis/

OpenFDA API. Accessed June 1, 2025, https://open.fda.gov/apis/

work page 2025

[34] [34]

Accessed June 1, 2025, https://drugcentral.org/OpenAPI

DrugCentral OpenAPI. Accessed June 1, 2025, https://drugcentral.org/OpenAPI

work page 2025

[35] [35]

API Documentation

DrugBank Online. API Documentation. Accessed June 1, 2025, https://docs.drugbank.com/

work page 2025

[36] [36]

Accessed June 1, 2025, https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html

RxNorm API. Accessed June 1, 2025, https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html

work page 2025