Recognition: no theorem link
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
Pith reviewed 2026-05-15 02:04 UTC · model grok-4.3
The pith
RxEval tests LLMs on specific medication-dose-route choices from detailed patient trajectories, revealing top models reach only 46 percent exact match.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RxEval consists of 1547 multiple-choice questions drawn from 584 patients across 18 diagnostic categories and 969 medications. Questions present evolving patient information and require selection of exact medication-dose-route triples, with distractors created by perturbing the reasoning chain that led to the original prescription. Model evaluations show wide performance spread yet uniformly low exact-match accuracy, indicating that frontier systems still overlook explicit patient constraints and fail to derive the correct clinical conclusion from the supplied trajectory.
What carries the argument
Prescription-level multiple-choice questions that pair real patient trajectories with distractors generated through reasoning-chain perturbation, forcing selection of precise medication-dose-route triples rather than broad drug classes.
If this is right
- LLMs must improve at tracking time-ordered changes in patient state to reach high accuracy on this task.
- Exact-match rates below 50 percent imply that direct deployment in prescribing workflows would require substantial human review.
- Error patterns centered on overlooked patient facts point to the need for better mechanisms that explicitly ground answers in the full trajectory.
- The wide spread in F1 scores across models demonstrates that the benchmark can rank systems more finely than coarser drug-code tasks.
Where Pith is reading between the lines
- If models close the gap on RxEval, the same patient-trajectory format could be adapted to evaluate other sequential clinical decisions such as lab ordering or discharge planning.
- The benchmark's emphasis on exact dose and route suggests that future safety evaluations should test not only correctness but also the model's ability to flag unsafe alternatives among the distractors.
- Real-world validation would require comparing model choices against actual clinician decisions rather than held-out prescriptions alone.
Load-bearing premise
The distractors created by perturbing the reasoning chain are realistic enough to stand in for the kinds of alternatives a clinician would actually consider.
What would settle it
A panel of clinicians reviewing a sample of distractors and rating most of them as clinically implausible would show that the benchmark does not measure real prescribing skill.
Figures
read the original abstract
Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RxEval, a prescription-level benchmark for evaluating LLMs on inpatient medication recommendation. It consists of 1,547 multiple-choice questions derived from 584 patients across 18 diagnostic categories and 969 medications. Each question provides a detailed patient profile and time-ordered clinical trajectory, requiring selection of the correct medication-dose-route triple from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. Evaluation of 16 LLMs reports F1 scores ranging from 45.18 to 77.10 and a maximum exact-match score of 46.10%, with error analysis showing models often overlook stated patient information.
Significance. If the distractors prove clinically valid and the questions are free of construction artifacts, RxEval would represent a meaningful advance over existing admission-level benchmarks that rely on coarse drug codes. It could serve as a more realistic test of per-timepoint clinical reasoning in LLMs, potentially guiding improvements in medical decision support systems.
major comments (2)
- [Abstract] Abstract and benchmark construction: The headline claim that RxEval is 'challenging and discriminative' rests on the observed F1 range and exact-match scores being meaningful measures of prescribing ability. This requires that distractors are realistic but incorrect alternatives. However, the manuscript provides no description of the reasoning-chain perturbation procedure, no quantitative check that perturbed chains remain clinically coherent, and no expert review or inter-rater agreement metrics confirming that distractors neither violate contraindications nor are accidentally correct. Without this validation, the score spread could be an artifact of question design rather than model capability.
- [Evaluation] Evaluation and results sections: No statistical significance testing, confidence intervals, or effect sizes are reported for the differences in F1 and exact-match scores across the 16 models. This makes it impossible to determine whether the observed spread (45.18–77.10 F1) reflects genuine discriminative power or sampling variability.
minor comments (2)
- [Abstract] The abstract states that RxEval spans 18 diagnostic categories but does not list or characterize them, nor explain selection criteria; this detail would help readers assess coverage.
- [Benchmark Construction] The manuscript should clarify whether the 1,547 questions are unique per patient trajectory or allow multiple questions per patient, as this affects independence assumptions in the evaluation.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and will revise the manuscript to incorporate improvements where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract and benchmark construction: The headline claim that RxEval is 'challenging and discriminative' rests on the observed F1 range and exact-match scores being meaningful measures of prescribing ability. This requires that distractors are realistic but incorrect alternatives. However, the manuscript provides no description of the reasoning-chain perturbation procedure, no quantitative check that perturbed chains remain clinically coherent, and no expert review or inter-rater agreement metrics confirming that distractors neither violate contraindications nor are accidentally correct. Without this validation, the score spread could be an artifact of question design rather than model capability.
Authors: We agree that the current manuscript lacks sufficient detail on the distractor generation process. In the revised version, we will expand the benchmark construction section with a complete description of the reasoning-chain perturbation procedure, including the specific rules used to alter clinical reasoning steps while preserving patient context. We will also add quantitative checks for coherence, such as measuring semantic similarity to original prescriptions and the rate at which perturbations change clinically critical elements. We acknowledge that no formal expert review or inter-rater agreement was performed; we will explicitly state this as a limitation and explain that distractors were generated from real prescriptions with targeted perturbations designed to produce plausible but incorrect options. revision: yes
-
Referee: [Evaluation] Evaluation and results sections: No statistical significance testing, confidence intervals, or effect sizes are reported for the differences in F1 and exact-match scores across the 16 models. This makes it impossible to determine whether the observed spread (45.18–77.10 F1) reflects genuine discriminative power or sampling variability.
Authors: We agree that statistical analysis is needed to support claims of discriminativeness. In the revision, we will add appropriate statistical tests (e.g., pairwise comparisons with multiple-testing correction), 95% confidence intervals for all reported F1 and exact-match scores, and effect sizes to quantify the magnitude of differences across models. revision: yes
Circularity Check
No significant circularity in benchmark construction or evaluation
full rationale
The paper introduces RxEval as a new multiple-choice benchmark derived from real patient trajectories and distractors generated via reasoning-chain perturbation. It then reports empirical performance of 16 external LLMs on this fixed benchmark using standard metrics (F1, Exact Match). No equations, fitted parameters, or predictions are presented; the central claims rest on observed score spreads across independent models rather than any self-referential reduction. Distractor generation is a construction step whose validity is external to the reported numbers, and no self-citation chain or uniqueness theorem is invoked to justify the results. The work is therefore self-contained as benchmark creation plus external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reasoning-chain perturbation generates clinically appropriate distractors that test real prescribing capability
Reference graph
Works this paper leans on
-
[1]
V ALS MedQA leaderboard. Website. URLhttps://www.vals.ai/benchmarks/medqa
-
[2]
K. Agyeman-Manu, T. A. Ghebreyesus, M. Maait, A. Rafila, L. Tom, N. T. Lima, and D. Wangmo. Prioritising the health and care workforce shortage: Protect, invest, together.The Lancet Global Health, 11 (8):e1162–e1164, 2023
work page 2023
-
[3]
Z. Ali, Y . Huang, I. Ullah, J. Feng, C. Deng, N. Thierry, A. Khan, A. U. Jan, X. Shen, W. Rui, et al. Deep learning for medication recommendation: A systematic survey.Data Intelligence, 5(2):303–354, 2023
work page 2023
-
[4]
Y . An, Y . Baosong, Z. Beichen, H. Binyuan, Z. Bo, Y . Bowen, L. Chengyuan, H. Fei, et al. Qwen2.5 technical report. Preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, et al. HealthBench: Evaluating large language models towards improved human health. Preprint arXiv:2505.08775, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
S. Bhoi, M.-L. Lee, W. Hsu, and N. C. Tan. REFINE: A fine-grained medication recommendation system using deep learning and personalized drug interaction modeling. InNeural Information Processing Systems, 2023
work page 2023
-
[7]
B. Chakraborty and S. A. Murphy. Dynamic treatment regimes.Annual Review of Statistics and Its Application, 1(1):447–464, 2014
work page 2014
-
[8]
J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang. HuatuoGPT-o1, towards medical complex reasoning with LLMs. Preprint arXiv:2412.18925, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
W. Du, X. Li, J. Feng, S. Zhang, W. Zhang, and Y . Wang. Integrating drug substructures and longitudinal electronic health records for personalized drug recommendation. InNeural Information Processing Systems, 2025
work page 2025
-
[11]
C. Fan, C. Gao, W. Shi, Y . Gong, Z. Zhao, and F. Feng. Fine-grained list-wise alignment for generative medication recommendation. InNeural Information Processing Systems, 2025
work page 2025
-
[12]
Gemma Team, A. Kamath, J. Ferret, S. Pathak, et al. Gemma 3 technical report. Preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
A. Gilson, C. W. Safranek, T. Huang, V . Socrates, L. Chi, R. A. Taylor, and D. Chartash. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? the implications of large language models for medical education and knowledge assessment.JMIR Medical Education, 9: e45312, 2023
work page 2023
-
[14]
A new era of intelligence with Gemini 3
Google. A new era of intelligence with Gemini 3. Technical Report, 2025
work page 2025
-
[15]
Google. Gemini 3.1 pro – model card. Technical Report, 2026
work page 2026
-
[16]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The LLaMA 3 herd of models. Preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Preprint arXiv:2009.03300, 2020. 10
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[18]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card. Preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
work page 2021
-
[20]
Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu. PubMedQA: A dataset for biomedical research question answering. InConference on Empirical Methods in Natural Language Processing, 2019
work page 2019
-
[21]
A. E. Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Shammout, S. Horng, T. J. Pollard, S. Hao, B. Moody, B. Gow, et al. MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023
work page 2023
-
[22]
X. Li, S. Liang, Y . Lei, C. Li, Y . Hou, D. Zheng, and T. Ma. CausalMed: Causality-based personalized medication recommendation centered on patient health state. InACM International Conference on Information and Knowledge Management, 2024
work page 2024
-
[23]
Y . Li, Q. Zhang, W. Lu, X. Peng, W. Zhang, J. Si, Y . Gong, and L. Hu. Time-aware medication recommendation via intervention of dynamic treatment regimes. InACM Web Conference, 2025
work page 2025
- [24]
- [25]
- [26]
-
[27]
H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz. Capabilities of GPT-4 on medical challenge problems. Preprint arXiv:2303.13375, 2023
work page internal anchor Pith review arXiv 2023
-
[28]
OpenAI. GPT-5.4 Thinking system card. Technical Report, 2026
work page 2026
-
[29]
A. Pal, L. K. Umapathi, and M. Sankarasubbu. MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, 2022
work page 2022
-
[30]
H. Quan, V . Sundararajan, P. Halfon, A. Fong, B. Burnand, J.-C. Luthi, L. D. Saunders, C. A. Beck, T. E. Feasby, and W. A. Ghali. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data.Medical Care, 43(11):1130–1139, 2005
work page 2005
- [31]
-
[32]
A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lau, et al. MedGemma technical report. Preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [33]
-
[34]
A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. OpenAI GPT-5 system card. Preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
work page 2023
-
[36]
K. K. Viktil, H. S. Blix, A. K. Eek, M. N. Davies, T. A. Moger, and A. Reikvam. How are drug regimen changes during hospitalisation handled after discharge: A cohort study.BMJ Open, 2(6):e001461, 2012
work page 2012
-
[37]
Anatomical therapeutic chemical (ATC) classification
WHO. Anatomical therapeutic chemical (ATC) classification. Technical Report, 2021
work page 2021
-
[38]
R. Wu, Z. Qiu, J. Jiang, G. Qi, and X. Wu. Conditional generation net for medication recommendation. In ACM Web Conference, 2022
work page 2022
-
[39]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. Preprint arXiv:2505.09388, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [40]
- [41]
-
[42]
Z. Zhao, C. Fan, J. Liu, Z. Wang, X. He, C. Gao, J. Li, and F. Feng. Fine-grained alignment of large language models for general medication recommendation without overprescription. Preprint arXiv:2503.03687, 2025. 12 A Related Works Medical QA Benchmarks.A growing family of benchmarks evaluates LLM medical competence through multiple-choice question answe...
-
[43]
SKIP if one medication name appears twice
-
[44]
SKIP if no medication is REASONED_NEW with timeline_dependent=true
-
[45]
PROCEED otherwise. OUTPUT (JSON): { "prescription_time": "...", "medications": [ { "medication": "<drug>", "classification": "REASONED_NEW|CONTINUATION|ROUTINE", "timeline_dependent": true|false, "triggering_event": "<timestamp + event, or null>", "reason": "<1 sentence>" } ], "decision": "PROCEED|SKIP", "reason": "<1 sentence>" } Prompt 3: Reasoning Chai...
work page 2024
-
[46]
Every cited fact exists in the inputs with correct values
-
[47]
No step cites its own medication or events after {prescription_time}
-
[48]
Each non-root step genuinely builds on its depends_on steps, not just introduces an independent fact
-
[49]
Inferences follow from the cited facts: no unsupported leaps, and no treating mere temporal association as causal rationale
-
[50]
Each inference must be medically correct given the cited facts (right interpretation of labs/conditions; drug is genuinely indicated)
-
[51]
The chain’s cumulative evidence must be SUFFICIENT to justify prescribing this specific medication. REJECT if a chain with no independent triggering event (empty or single-step restatement of the prescription)
-
[52]
The chain’s summary and all inference fields must make a committed assertion about the core indication: not concede it is unknown or speculative (e.g., " unclear", "presumably", "may have been given for"). Uncertainty about peripheral details is acceptable; REJECT if uncertainty about the core indication. CHECK THE ENTIRE TIMEPOINT:
-
[53]
If none is, REJECT the entire timepoint
At least one chain must be driven by a new in-hospital event from the timeline, not just admission info, home meds, or standard protocols. If none is, REJECT the entire timepoint
-
[54]
No two chains may contradict each other on the same clinical fact or its interpretation. KEY PRINCIPLE: Clinically plausible does not equals to evidenced. If reasoning is sensible but not supported by THIS patient’s data, REVISE it. A chain that openly admits uncertainty about why the drug was prescribed is NOT a valid chain, regardless of how well the su...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.