pith. machine review for the scientific record. sign in

arxiv: 2605.14543 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Recognition: no theorem link

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM evaluationmedication recommendationclinical benchmarkprescription predictionhealthcare AImultiple-choice questionspatient trajectory
0
0 comments X

The pith

RxEval tests LLMs on specific medication-dose-route choices from detailed patient trajectories, revealing top models reach only 46 percent exact match.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RxEval as a benchmark that shifts medication recommendation evaluation from coarse admission-level predictions to fine-grained prescription-level decisions. Each test item supplies a full time-ordered clinical profile and asks the model to pick the correct medication, dose, and route from real prescriptions plus targeted distractors. Testing sixteen models produces F1 scores between 45 and 77 and a highest exact-match rate of 46 percent, with errors often traceable to missed patient details or incomplete clinical reasoning. If the benchmark holds, it implies that current LLMs remain unreliable for direct prescribing support without human oversight.

Core claim

RxEval consists of 1547 multiple-choice questions drawn from 584 patients across 18 diagnostic categories and 969 medications. Questions present evolving patient information and require selection of exact medication-dose-route triples, with distractors created by perturbing the reasoning chain that led to the original prescription. Model evaluations show wide performance spread yet uniformly low exact-match accuracy, indicating that frontier systems still overlook explicit patient constraints and fail to derive the correct clinical conclusion from the supplied trajectory.

What carries the argument

Prescription-level multiple-choice questions that pair real patient trajectories with distractors generated through reasoning-chain perturbation, forcing selection of precise medication-dose-route triples rather than broad drug classes.

If this is right

  • LLMs must improve at tracking time-ordered changes in patient state to reach high accuracy on this task.
  • Exact-match rates below 50 percent imply that direct deployment in prescribing workflows would require substantial human review.
  • Error patterns centered on overlooked patient facts point to the need for better mechanisms that explicitly ground answers in the full trajectory.
  • The wide spread in F1 scores across models demonstrates that the benchmark can rank systems more finely than coarser drug-code tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If models close the gap on RxEval, the same patient-trajectory format could be adapted to evaluate other sequential clinical decisions such as lab ordering or discharge planning.
  • The benchmark's emphasis on exact dose and route suggests that future safety evaluations should test not only correctness but also the model's ability to flag unsafe alternatives among the distractors.
  • Real-world validation would require comparing model choices against actual clinician decisions rather than held-out prescriptions alone.

Load-bearing premise

The distractors created by perturbing the reasoning chain are realistic enough to stand in for the kinds of alternatives a clinician would actually consider.

What would settle it

A panel of clinicians reviewing a sample of distractors and rating most of them as clinically implausible would show that the benchmark does not measure real prescribing skill.

Figures

Figures reproduced from arXiv: 2605.14543 by Changmiao Wang, James T. Kwok, Shuhao Chen, Weisen Jiang, Xiaoqing Wu, Xuanren Shi, Yu Zhang.

Figure 1
Figure 1. Figure 1: Comparison between admission-level and prescription-level medication recommendation. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of RxEval MCQ construction pipeline. Reasoning-chain annotation (top) extracts stepwise clinical reasoning for each correct medication and validates it through a critic that checks factual grounding and logical coherence. Distractor generation and validation (bottom) produces patient-specific distractors by perturbing the reasoning chain and verifies that each distractor is both verifiably wrong a… view at source ↗
Figure 3
Figure 3. Figure 3: Dataset composition of RxEval. (a) Summary statistics. (b) Diagnostic coverage across 18 ICD chapters. (c) Long-tail medication frequency for correct and distractor medications. (d) Distri￾bution of correct options per MCQ: 44.9% are multi-answer. (e) Mean clinical events per MCQ by temporal phase, showing richer context in later stages. (f) Prompt token length distribution. The medication space contains 9… view at source ↗
Figure 7
Figure 7. Figure 7: Exact-match score by number of ground-truth medica￾tions (nc). Larger answer sets are harder [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 5
Figure 5. Figure 5: F1 of GPT and Gemini families on RxEval. GPT-4o Gemini-2.5-Flash Gemini-2.5-Pro Gemini-3-Flash GPT-5-Mini GPT-5 GPT-5.4 Gemini-3.1-Pro 55 60 65 70 75 80 F1 score on TemRx (%) Admission phase Early Middle Late [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative failure cases from GPT-5.4 and Gemini-3.1-Pro on [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RxEval, a prescription-level benchmark for evaluating LLMs on inpatient medication recommendation. It consists of 1,547 multiple-choice questions derived from 584 patients across 18 diagnostic categories and 969 medications. Each question provides a detailed patient profile and time-ordered clinical trajectory, requiring selection of the correct medication-dose-route triple from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. Evaluation of 16 LLMs reports F1 scores ranging from 45.18 to 77.10 and a maximum exact-match score of 46.10%, with error analysis showing models often overlook stated patient information.

Significance. If the distractors prove clinically valid and the questions are free of construction artifacts, RxEval would represent a meaningful advance over existing admission-level benchmarks that rely on coarse drug codes. It could serve as a more realistic test of per-timepoint clinical reasoning in LLMs, potentially guiding improvements in medical decision support systems.

major comments (2)
  1. [Abstract] Abstract and benchmark construction: The headline claim that RxEval is 'challenging and discriminative' rests on the observed F1 range and exact-match scores being meaningful measures of prescribing ability. This requires that distractors are realistic but incorrect alternatives. However, the manuscript provides no description of the reasoning-chain perturbation procedure, no quantitative check that perturbed chains remain clinically coherent, and no expert review or inter-rater agreement metrics confirming that distractors neither violate contraindications nor are accidentally correct. Without this validation, the score spread could be an artifact of question design rather than model capability.
  2. [Evaluation] Evaluation and results sections: No statistical significance testing, confidence intervals, or effect sizes are reported for the differences in F1 and exact-match scores across the 16 models. This makes it impossible to determine whether the observed spread (45.18–77.10 F1) reflects genuine discriminative power or sampling variability.
minor comments (2)
  1. [Abstract] The abstract states that RxEval spans 18 diagnostic categories but does not list or characterize them, nor explain selection criteria; this detail would help readers assess coverage.
  2. [Benchmark Construction] The manuscript should clarify whether the 1,547 questions are unique per patient trajectory or allow multiple questions per patient, as this affects independence assumptions in the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and will revise the manuscript to incorporate improvements where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract and benchmark construction: The headline claim that RxEval is 'challenging and discriminative' rests on the observed F1 range and exact-match scores being meaningful measures of prescribing ability. This requires that distractors are realistic but incorrect alternatives. However, the manuscript provides no description of the reasoning-chain perturbation procedure, no quantitative check that perturbed chains remain clinically coherent, and no expert review or inter-rater agreement metrics confirming that distractors neither violate contraindications nor are accidentally correct. Without this validation, the score spread could be an artifact of question design rather than model capability.

    Authors: We agree that the current manuscript lacks sufficient detail on the distractor generation process. In the revised version, we will expand the benchmark construction section with a complete description of the reasoning-chain perturbation procedure, including the specific rules used to alter clinical reasoning steps while preserving patient context. We will also add quantitative checks for coherence, such as measuring semantic similarity to original prescriptions and the rate at which perturbations change clinically critical elements. We acknowledge that no formal expert review or inter-rater agreement was performed; we will explicitly state this as a limitation and explain that distractors were generated from real prescriptions with targeted perturbations designed to produce plausible but incorrect options. revision: yes

  2. Referee: [Evaluation] Evaluation and results sections: No statistical significance testing, confidence intervals, or effect sizes are reported for the differences in F1 and exact-match scores across the 16 models. This makes it impossible to determine whether the observed spread (45.18–77.10 F1) reflects genuine discriminative power or sampling variability.

    Authors: We agree that statistical analysis is needed to support claims of discriminativeness. In the revision, we will add appropriate statistical tests (e.g., pairwise comparisons with multiple-testing correction), 95% confidence intervals for all reported F1 and exact-match scores, and effect sizes to quantify the magnitude of differences across models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces RxEval as a new multiple-choice benchmark derived from real patient trajectories and distractors generated via reasoning-chain perturbation. It then reports empirical performance of 16 external LLMs on this fixed benchmark using standard metrics (F1, Exact Match). No equations, fitted parameters, or predictions are presented; the central claims rest on observed score spreads across independent models rather than any self-referential reduction. Distractor generation is a construction step whose validity is external to the reported numbers, and no self-citation chain or uniqueness theorem is invoked to justify the results. The work is therefore self-contained as benchmark creation plus external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that reasoning-chain perturbation produces clinically plausible distractors and that the selected patient trajectories are representative; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Reasoning-chain perturbation generates clinically appropriate distractors that test real prescribing capability
    Central to making the multiple-choice questions discriminative rather than trivial.

pith-pipeline@v0.9.0 · 5490 in / 1237 out tokens · 36133 ms · 2026-05-15T02:04:49.937450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 12 internal anchors

  1. [1]

    V ALS MedQA leaderboard. Website. URLhttps://www.vals.ai/benchmarks/medqa

  2. [2]

    Agyeman-Manu, T

    K. Agyeman-Manu, T. A. Ghebreyesus, M. Maait, A. Rafila, L. Tom, N. T. Lima, and D. Wangmo. Prioritising the health and care workforce shortage: Protect, invest, together.The Lancet Global Health, 11 (8):e1162–e1164, 2023

  3. [3]

    Z. Ali, Y . Huang, I. Ullah, J. Feng, C. Deng, N. Thierry, A. Khan, A. U. Jan, X. Shen, W. Rui, et al. Deep learning for medication recommendation: A systematic survey.Data Intelligence, 5(2):303–354, 2023

  4. [4]

    Y . An, Y . Baosong, Z. Beichen, H. Binyuan, Z. Bo, Y . Bowen, L. Chengyuan, H. Fei, et al. Qwen2.5 technical report. Preprint arXiv:2412.15115, 2024

  5. [5]

    R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, et al. HealthBench: Evaluating large language models towards improved human health. Preprint arXiv:2505.08775, 2025

  6. [6]

    Bhoi, M.-L

    S. Bhoi, M.-L. Lee, W. Hsu, and N. C. Tan. REFINE: A fine-grained medication recommendation system using deep learning and personalized drug interaction modeling. InNeural Information Processing Systems, 2023

  7. [7]

    Chakraborty and S

    B. Chakraborty and S. A. Murphy. Dynamic treatment regimes.Annual Review of Statistics and Its Application, 1(1):447–464, 2014

  8. [8]

    J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang. HuatuoGPT-o1, towards medical complex reasoning with LLMs. Preprint arXiv:2412.18925, 2024

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Preprint arXiv:2507.06261, 2025

  10. [10]

    W. Du, X. Li, J. Feng, S. Zhang, W. Zhang, and Y . Wang. Integrating drug substructures and longitudinal electronic health records for personalized drug recommendation. InNeural Information Processing Systems, 2025

  11. [11]

    C. Fan, C. Gao, W. Shi, Y . Gong, Z. Zhao, and F. Feng. Fine-grained list-wise alignment for generative medication recommendation. InNeural Information Processing Systems, 2025

  12. [12]

    Gemma 3 Technical Report

    Gemma Team, A. Kamath, J. Ferret, S. Pathak, et al. Gemma 3 technical report. Preprint arXiv:2503.19786, 2025

  13. [13]

    Gilson, C

    A. Gilson, C. W. Safranek, T. Huang, V . Socrates, L. Chi, R. A. Taylor, and D. Chartash. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? the implications of large language models for medical education and knowledge assessment.JMIR Medical Education, 9: e45312, 2023

  14. [14]

    A new era of intelligence with Gemini 3

    Google. A new era of intelligence with Gemini 3. Technical Report, 2025

  15. [15]

    Gemini 3.1 pro – model card

    Google. Gemini 3.1 pro – model card. Technical Report, 2026

  16. [16]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The LLaMA 3 herd of models. Preprint arXiv:2407.21783, 2024

  17. [17]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Preprint arXiv:2009.03300, 2020. 10

  18. [18]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card. Preprint arXiv:2410.21276, 2024

  19. [19]

    D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  20. [20]

    Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu. PubMedQA: A dataset for biomedical research question answering. InConference on Empirical Methods in Natural Language Processing, 2019

  21. [21]

    A. E. Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Shammout, S. Horng, T. J. Pollard, S. Hao, B. Moody, B. Gow, et al. MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023

  22. [22]

    X. Li, S. Liang, Y . Lei, C. Li, Y . Hou, D. Zheng, and T. Ma. CausalMed: Causality-based personalized medication recommendation centered on patient health state. InACM International Conference on Information and Knowledge Management, 2024

  23. [23]

    Y . Li, Q. Zhang, W. Lu, X. Peng, W. Zhang, J. Si, Y . Gong, and L. Hu. Time-aware medication recommendation via intervention of dynamic treatment regimes. InACM Web Conference, 2025

  24. [24]

    Liang, X

    S. Liang, X. Li, S. Mu, C. Li, Y . Lei, Y . Hou, and T. Ma. CIDGMed: Causal inference-driven medication recommendation with enhanced dual-granularity learning.Knowledge-Based Systems, 309:112685, 2025

  25. [25]

    Q. Liu, X. Wu, X. Zhao, Y . Zhu, Z. Zhang, F. Tian, and Y . Zheng. Large language model distilling medication recommendation model. Preprint arXiv:2402.02803, 2024

  26. [26]

    M. D. Ma, C. Ye, Y . Yan, X. Wang, P. Ping, T. S. Chang, and W. Wang. CLIBench: A multifaceted and multigranular evaluation of large language models for clinical decision making. Preprint arXiv:2406.09923, 2024

  27. [27]

    H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz. Capabilities of GPT-4 on medical challenge problems. Preprint arXiv:2303.13375, 2023

  28. [28]

    GPT-5.4 Thinking system card

    OpenAI. GPT-5.4 Thinking system card. Technical Report, 2026

  29. [29]

    A. Pal, L. K. Umapathi, and M. Sankarasubbu. MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, 2022

  30. [30]

    H. Quan, V . Sundararajan, P. Halfon, A. Fong, B. Burnand, J.-C. Luthi, L. D. Saunders, C. A. Beck, T. E. Feasby, and W. A. Ghali. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data.Medical Care, 43(11):1130–1139, 2005

  31. [31]

    Rough, A

    K. Rough, A. M. Dai, K. Zhang, Y . Xue, L. M. Vardoulakis, C. Cui, A. J. Butte, M. D. Howell, and A. Ra- jkomar. Predicting inpatient medication orders from electronic health record data.Clinical Pharmacology & Therapeutics, 108(1):145–154, 2020

  32. [32]

    MedGemma Technical Report

    A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lau, et al. MedGemma technical report. Preprint arXiv:2507.05201, 2025

  33. [33]

    Shang, C

    J. Shang, C. Xiao, T. Ma, H. Li, and J. Sun. GAMENet: Graph augmented memory networks for recommending medication combination. InAAAI Conference on Artificial Intelligence, 2019

  34. [34]

    OpenAI GPT-5 System Card

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. OpenAI GPT-5 system card. Preprint arXiv:2601.03267, 2025

  35. [35]

    Singhal, S

    K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

  36. [36]

    K. K. Viktil, H. S. Blix, A. K. Eek, M. N. Davies, T. A. Moger, and A. Reikvam. How are drug regimen changes during hospitalisation handled after discharge: A cohort study.BMJ Open, 2(6):e001461, 2012

  37. [37]

    Anatomical therapeutic chemical (ATC) classification

    WHO. Anatomical therapeutic chemical (ATC) classification. Technical Report, 2021

  38. [38]

    R. Wu, Z. Qiu, J. Jiang, G. Qi, and X. Wu. Conditional generation net for medication recommendation. In ACM Web Conference, 2022

  39. [39]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. Preprint arXiv:2505.09388, 2025. 11

  40. [40]

    C. Yang, C. Xiao, F. Ma, L. Glass, and J. Sun. SafeDrug: Dual molecular graph encoders for recommending effective and safe drug combinations. Preprint arXiv:2105.02711, 2021

  41. [41]

    J. Zhao, L. Xu, M. Tan, L. Zhang, A. Argha, H. Alinejad-Rokny, and M. Yang. RxSafeBench: Identifying medication safety issues of large language models in simulated consultation. Preprint arXiv:2511.04328, 2025

  42. [42]

    Addressing overprescribing challenges: Fine-tuning large language models for medication recommendation tasks.arXiv preprint arXiv:2503.03687.2025

    Z. Zhao, C. Fan, J. Liu, Z. Wang, X. He, C. Gao, J. Li, and F. Feng. Fine-grained alignment of large language models for general medication recommendation without overprescription. Preprint arXiv:2503.03687, 2025. 12 A Related Works Medical QA Benchmarks.A growing family of benchmarks evaluates LLM medical competence through multiple-choice question answe...

  43. [43]

    SKIP if one medication name appears twice

  44. [44]

    SKIP if no medication is REASONED_NEW with timeline_dependent=true

  45. [45]

    prescription_time

    PROCEED otherwise. OUTPUT (JSON): { "prescription_time": "...", "medications": [ { "medication": "<drug>", "classification": "REASONED_NEW|CONTINUATION|ROUTINE", "timeline_dependent": true|false, "triggering_event": "<timestamp + event, or null>", "reason": "<1 sentence>" } ], "decision": "PROCEED|SKIP", "reason": "<1 sentence>" } Prompt 3: Reasoning Chai...

  46. [46]

    Every cited fact exists in the inputs with correct values

  47. [47]

    No step cites its own medication or events after {prescription_time}

  48. [48]

    Each non-root step genuinely builds on its depends_on steps, not just introduces an independent fact

  49. [49]

    Inferences follow from the cited facts: no unsupported leaps, and no treating mere temporal association as causal rationale

  50. [50]

    Each inference must be medically correct given the cited facts (right interpretation of labs/conditions; drug is genuinely indicated)

  51. [51]

    REJECT if a chain with no independent triggering event (empty or single-step restatement of the prescription)

    The chain’s cumulative evidence must be SUFFICIENT to justify prescribing this specific medication. REJECT if a chain with no independent triggering event (empty or single-step restatement of the prescription)

  52. [52]

    unclear

    The chain’s summary and all inference fields must make a committed assertion about the core indication: not concede it is unknown or speculative (e.g., " unclear", "presumably", "may have been given for"). Uncertainty about peripheral details is acceptable; REJECT if uncertainty about the core indication. CHECK THE ENTIRE TIMEPOINT:

  53. [53]

    If none is, REJECT the entire timepoint

    At least one chain must be driven by a new in-hospital event from the timeline, not just admission info, home meds, or standard protocols. If none is, REJECT the entire timepoint

  54. [54]

    prescription_time

    No two chains may contradict each other on the same clinical fact or its interpretation. KEY PRINCIPLE: Clinically plausible does not equals to evidenced. If reasoning is sensible but not supported by THIS patient’s data, REVISE it. A chain that openly admits uncertainty about why the drug was prescribed is NOT a valid chain, regardless of how well the su...