arxiv: 2605.14543 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Recognition: no theorem link

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

Shuhao Chen , Weisen Jiang , Changmiao Wang , Xiaoqing Wu , Xuanren Shi , Yu Zhang , James T. Kwok

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM evaluationmedication recommendationclinical benchmarkprescription predictionhealthcare AImultiple-choice questionspatient trajectory

0 comments

The pith

RxEval tests LLMs on specific medication-dose-route choices from detailed patient trajectories, revealing top models reach only 46 percent exact match.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RxEval as a benchmark that shifts medication recommendation evaluation from coarse admission-level predictions to fine-grained prescription-level decisions. Each test item supplies a full time-ordered clinical profile and asks the model to pick the correct medication, dose, and route from real prescriptions plus targeted distractors. Testing sixteen models produces F1 scores between 45 and 77 and a highest exact-match rate of 46 percent, with errors often traceable to missed patient details or incomplete clinical reasoning. If the benchmark holds, it implies that current LLMs remain unreliable for direct prescribing support without human oversight.

Core claim

RxEval consists of 1547 multiple-choice questions drawn from 584 patients across 18 diagnostic categories and 969 medications. Questions present evolving patient information and require selection of exact medication-dose-route triples, with distractors created by perturbing the reasoning chain that led to the original prescription. Model evaluations show wide performance spread yet uniformly low exact-match accuracy, indicating that frontier systems still overlook explicit patient constraints and fail to derive the correct clinical conclusion from the supplied trajectory.

What carries the argument

Prescription-level multiple-choice questions that pair real patient trajectories with distractors generated through reasoning-chain perturbation, forcing selection of precise medication-dose-route triples rather than broad drug classes.

If this is right

LLMs must improve at tracking time-ordered changes in patient state to reach high accuracy on this task.
Exact-match rates below 50 percent imply that direct deployment in prescribing workflows would require substantial human review.
Error patterns centered on overlooked patient facts point to the need for better mechanisms that explicitly ground answers in the full trajectory.
The wide spread in F1 scores across models demonstrates that the benchmark can rank systems more finely than coarser drug-code tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If models close the gap on RxEval, the same patient-trajectory format could be adapted to evaluate other sequential clinical decisions such as lab ordering or discharge planning.
The benchmark's emphasis on exact dose and route suggests that future safety evaluations should test not only correctness but also the model's ability to flag unsafe alternatives among the distractors.
Real-world validation would require comparing model choices against actual clinician decisions rather than held-out prescriptions alone.

Load-bearing premise

The distractors created by perturbing the reasoning chain are realistic enough to stand in for the kinds of alternatives a clinician would actually consider.

What would settle it

A panel of clinicians reviewing a sample of distractors and rating most of them as clinically implausible would show that the benchmark does not measure real prescribing skill.

Figures

Figures reproduced from arXiv: 2605.14543 by Changmiao Wang, James T. Kwok, Shuhao Chen, Weisen Jiang, Xiaoqing Wu, Xuanren Shi, Yu Zhang.

**Figure 2.** Figure 2: Overview of RxEval MCQ construction pipeline. Reasoning-chain annotation (top) extracts stepwise clinical reasoning for each correct medication and validates it through a critic that checks factual grounding and logical coherence. Distractor generation and validation (bottom) produces patient-specific distractors by perturbing the reasoning chain and verifies that each distractor is both verifiably wrong a… view at source ↗

**Figure 3.** Figure 3: Dataset composition of RxEval. (a) Summary statistics. (b) Diagnostic coverage across 18 ICD chapters. (c) Long-tail medication frequency for correct and distractor medications. (d) Distribution of correct options per MCQ: 44.9% are multi-answer. (e) Mean clinical events per MCQ by temporal phase, showing richer context in later stages. (f) Prompt token length distribution. The medication space contains 9… view at source ↗

**Figure 7.** Figure 7: Exact-match score by number of ground-truth medications (nc). Larger answer sets are harder [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 5.** Figure 5: F1 of GPT and Gemini families on RxEval. GPT-4o Gemini-2.5-Flash Gemini-2.5-Pro Gemini-3-Flash GPT-5-Mini GPT-5 GPT-5.4 Gemini-3.1-Pro 55 60 65 70 75 80 F1 score on TemRx (%) Admission phase Early Middle Late [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 8.** Figure 8: Representative failure cases from GPT-5.4 and Gemini-3.1-Pro on [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RxEval moves to prescription-level MCQs with real trajectories and perturbed distractors, which is a clear step past admission-level multi-hot setups, but the lack of reported clinical checks on those distractors leaves the performance numbers hard to interpret.

read the letter

The paper's main contribution is a benchmark that tests LLMs on choosing specific medication-dose-route triples at individual prescription points rather than predicting broad drug sets at admission. It builds 1,547 questions from 584 patients, pulls in time-ordered clinical details, and adds patient-specific distractors made by perturbing reasoning chains. Evaluation of 16 models produces F1 scores from 45 to 77 and a top exact-match rate of 46 percent, with error patterns showing models missing stated patient facts. This framing is more realistic than the coarse multi-hot code prediction tasks that most prior clinical LLM papers use, and the scale plus the per-timepoint structure give it practical value for anyone measuring prescribing capability. The construction itself is straightforward and the results show the task remains difficult even for frontier models. The soft spot is the distractor process. The abstract describes reasoning-chain perturbation but supplies no procedure details, no expert review numbers, no check that perturbed options stay clinically coherent or respect contraindications, and no test that they are not accidentally correct. If the distractors are sometimes nonsensical or ignore patient data, the score spread and the error analysis lose force. Minor gaps include missing statistical tests on model differences and no inter-rater figures for any manual steps. This work is aimed at groups building or auditing clinical LLMs who need tighter evaluation than current admission-level benchmarks provide. A reader focused on medical AI safety or benchmark design would find the per-timepoint approach worth examining. It deserves peer review because the core shift addresses a real limitation in existing tests, though reviewers would need to see the full distractor validation before the numbers can be taken at face value. I would send it out.

Referee Report

2 major / 2 minor

Summary. The paper introduces RxEval, a prescription-level benchmark for evaluating LLMs on inpatient medication recommendation. It consists of 1,547 multiple-choice questions derived from 584 patients across 18 diagnostic categories and 969 medications. Each question provides a detailed patient profile and time-ordered clinical trajectory, requiring selection of the correct medication-dose-route triple from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. Evaluation of 16 LLMs reports F1 scores ranging from 45.18 to 77.10 and a maximum exact-match score of 46.10%, with error analysis showing models often overlook stated patient information.

Significance. If the distractors prove clinically valid and the questions are free of construction artifacts, RxEval would represent a meaningful advance over existing admission-level benchmarks that rely on coarse drug codes. It could serve as a more realistic test of per-timepoint clinical reasoning in LLMs, potentially guiding improvements in medical decision support systems.

major comments (2)

[Abstract] Abstract and benchmark construction: The headline claim that RxEval is 'challenging and discriminative' rests on the observed F1 range and exact-match scores being meaningful measures of prescribing ability. This requires that distractors are realistic but incorrect alternatives. However, the manuscript provides no description of the reasoning-chain perturbation procedure, no quantitative check that perturbed chains remain clinically coherent, and no expert review or inter-rater agreement metrics confirming that distractors neither violate contraindications nor are accidentally correct. Without this validation, the score spread could be an artifact of question design rather than model capability.
[Evaluation] Evaluation and results sections: No statistical significance testing, confidence intervals, or effect sizes are reported for the differences in F1 and exact-match scores across the 16 models. This makes it impossible to determine whether the observed spread (45.18–77.10 F1) reflects genuine discriminative power or sampling variability.

minor comments (2)

[Abstract] The abstract states that RxEval spans 18 diagnostic categories but does not list or characterize them, nor explain selection criteria; this detail would help readers assess coverage.
[Benchmark Construction] The manuscript should clarify whether the 1,547 questions are unique per patient trajectory or allow multiple questions per patient, as this affects independence assumptions in the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and will revise the manuscript to incorporate improvements where feasible.

read point-by-point responses

Referee: [Abstract] Abstract and benchmark construction: The headline claim that RxEval is 'challenging and discriminative' rests on the observed F1 range and exact-match scores being meaningful measures of prescribing ability. This requires that distractors are realistic but incorrect alternatives. However, the manuscript provides no description of the reasoning-chain perturbation procedure, no quantitative check that perturbed chains remain clinically coherent, and no expert review or inter-rater agreement metrics confirming that distractors neither violate contraindications nor are accidentally correct. Without this validation, the score spread could be an artifact of question design rather than model capability.

Authors: We agree that the current manuscript lacks sufficient detail on the distractor generation process. In the revised version, we will expand the benchmark construction section with a complete description of the reasoning-chain perturbation procedure, including the specific rules used to alter clinical reasoning steps while preserving patient context. We will also add quantitative checks for coherence, such as measuring semantic similarity to original prescriptions and the rate at which perturbations change clinically critical elements. We acknowledge that no formal expert review or inter-rater agreement was performed; we will explicitly state this as a limitation and explain that distractors were generated from real prescriptions with targeted perturbations designed to produce plausible but incorrect options. revision: yes
Referee: [Evaluation] Evaluation and results sections: No statistical significance testing, confidence intervals, or effect sizes are reported for the differences in F1 and exact-match scores across the 16 models. This makes it impossible to determine whether the observed spread (45.18–77.10 F1) reflects genuine discriminative power or sampling variability.

Authors: We agree that statistical analysis is needed to support claims of discriminativeness. In the revision, we will add appropriate statistical tests (e.g., pairwise comparisons with multiple-testing correction), 95% confidence intervals for all reported F1 and exact-match scores, and effect sizes to quantify the magnitude of differences across models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces RxEval as a new multiple-choice benchmark derived from real patient trajectories and distractors generated via reasoning-chain perturbation. It then reports empirical performance of 16 external LLMs on this fixed benchmark using standard metrics (F1, Exact Match). No equations, fitted parameters, or predictions are presented; the central claims rest on observed score spreads across independent models rather than any self-referential reduction. Distractor generation is a construction step whose validity is external to the reported numbers, and no self-citation chain or uniqueness theorem is invoked to justify the results. The work is therefore self-contained as benchmark creation plus external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that reasoning-chain perturbation produces clinically plausible distractors and that the selected patient trajectories are representative; no free parameters or new entities are introduced.

axioms (1)

domain assumption Reasoning-chain perturbation generates clinically appropriate distractors that test real prescribing capability
Central to making the multiple-choice questions discriminative rather than trivial.

pith-pipeline@v0.9.0 · 5490 in / 1237 out tokens · 36133 ms · 2026-05-15T02:04:49.937450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 12 internal anchors

[1]

V ALS MedQA leaderboard. Website. URLhttps://www.vals.ai/benchmarks/medqa

work page
[2]

Agyeman-Manu, T

K. Agyeman-Manu, T. A. Ghebreyesus, M. Maait, A. Rafila, L. Tom, N. T. Lima, and D. Wangmo. Prioritising the health and care workforce shortage: Protect, invest, together.The Lancet Global Health, 11 (8):e1162–e1164, 2023

work page 2023
[3]

Z. Ali, Y . Huang, I. Ullah, J. Feng, C. Deng, N. Thierry, A. Khan, A. U. Jan, X. Shen, W. Rui, et al. Deep learning for medication recommendation: A systematic survey.Data Intelligence, 5(2):303–354, 2023

work page 2023
[4]

Y . An, Y . Baosong, Z. Beichen, H. Binyuan, Z. Bo, Y . Bowen, L. Chengyuan, H. Fei, et al. Qwen2.5 technical report. Preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, et al. HealthBench: Evaluating large language models towards improved human health. Preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Bhoi, M.-L

S. Bhoi, M.-L. Lee, W. Hsu, and N. C. Tan. REFINE: A fine-grained medication recommendation system using deep learning and personalized drug interaction modeling. InNeural Information Processing Systems, 2023

work page 2023
[7]

Chakraborty and S

B. Chakraborty and S. A. Murphy. Dynamic treatment regimes.Annual Review of Statistics and Its Application, 1(1):447–464, 2014

work page 2014
[8]

J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang. HuatuoGPT-o1, towards medical complex reasoning with LLMs. Preprint arXiv:2412.18925, 2024

work page internal anchor Pith review arXiv 2024
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

W. Du, X. Li, J. Feng, S. Zhang, W. Zhang, and Y . Wang. Integrating drug substructures and longitudinal electronic health records for personalized drug recommendation. InNeural Information Processing Systems, 2025

work page 2025
[11]

C. Fan, C. Gao, W. Shi, Y . Gong, Z. Zhao, and F. Feng. Fine-grained list-wise alignment for generative medication recommendation. InNeural Information Processing Systems, 2025

work page 2025
[12]

Gemma 3 Technical Report

Gemma Team, A. Kamath, J. Ferret, S. Pathak, et al. Gemma 3 technical report. Preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Gilson, C

A. Gilson, C. W. Safranek, T. Huang, V . Socrates, L. Chi, R. A. Taylor, and D. Chartash. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? the implications of large language models for medical education and knowledge assessment.JMIR Medical Education, 9: e45312, 2023

work page 2023
[14]

A new era of intelligence with Gemini 3

Google. A new era of intelligence with Gemini 3. Technical Report, 2025

work page 2025
[15]

Gemini 3.1 pro – model card

Google. Gemini 3.1 pro – model card. Technical Report, 2026

work page 2026
[16]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The LLaMA 3 herd of models. Preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Preprint arXiv:2009.03300, 2020. 10

work page internal anchor Pith review Pith/arXiv arXiv 2009
[18]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card. Preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021
[20]

Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu. PubMedQA: A dataset for biomedical research question answering. InConference on Empirical Methods in Natural Language Processing, 2019

work page 2019
[21]

A. E. Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Shammout, S. Horng, T. J. Pollard, S. Hao, B. Moody, B. Gow, et al. MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023

work page 2023
[22]

X. Li, S. Liang, Y . Lei, C. Li, Y . Hou, D. Zheng, and T. Ma. CausalMed: Causality-based personalized medication recommendation centered on patient health state. InACM International Conference on Information and Knowledge Management, 2024

work page 2024
[23]

Y . Li, Q. Zhang, W. Lu, X. Peng, W. Zhang, J. Si, Y . Gong, and L. Hu. Time-aware medication recommendation via intervention of dynamic treatment regimes. InACM Web Conference, 2025

work page 2025
[24]

Liang, X

S. Liang, X. Li, S. Mu, C. Li, Y . Lei, Y . Hou, and T. Ma. CIDGMed: Causal inference-driven medication recommendation with enhanced dual-granularity learning.Knowledge-Based Systems, 309:112685, 2025

work page 2025
[25]

Q. Liu, X. Wu, X. Zhao, Y . Zhu, Z. Zhang, F. Tian, and Y . Zheng. Large language model distilling medication recommendation model. Preprint arXiv:2402.02803, 2024

work page arXiv 2024
[26]

M. D. Ma, C. Ye, Y . Yan, X. Wang, P. Ping, T. S. Chang, and W. Wang. CLIBench: A multifaceted and multigranular evaluation of large language models for clinical decision making. Preprint arXiv:2406.09923, 2024

work page arXiv 2024
[27]

H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz. Capabilities of GPT-4 on medical challenge problems. Preprint arXiv:2303.13375, 2023

work page internal anchor Pith review arXiv 2023
[28]

GPT-5.4 Thinking system card

OpenAI. GPT-5.4 Thinking system card. Technical Report, 2026

work page 2026
[29]

A. Pal, L. K. Umapathi, and M. Sankarasubbu. MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, 2022

work page 2022
[30]

H. Quan, V . Sundararajan, P. Halfon, A. Fong, B. Burnand, J.-C. Luthi, L. D. Saunders, C. A. Beck, T. E. Feasby, and W. A. Ghali. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data.Medical Care, 43(11):1130–1139, 2005

work page 2005
[31]

Rough, A

K. Rough, A. M. Dai, K. Zhang, Y . Xue, L. M. Vardoulakis, C. Cui, A. J. Butte, M. D. Howell, and A. Ra- jkomar. Predicting inpatient medication orders from electronic health record data.Clinical Pharmacology & Therapeutics, 108(1):145–154, 2020

work page 2020
[32]

MedGemma Technical Report

A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lau, et al. MedGemma technical report. Preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Shang, C

J. Shang, C. Xiao, T. Ma, H. Li, and J. Sun. GAMENet: Graph augmented memory networks for recommending medication combination. InAAAI Conference on Artificial Intelligence, 2019

work page 2019
[34]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. OpenAI GPT-5 system card. Preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Singhal, S

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

work page 2023
[36]

K. K. Viktil, H. S. Blix, A. K. Eek, M. N. Davies, T. A. Moger, and A. Reikvam. How are drug regimen changes during hospitalisation handled after discharge: A cohort study.BMJ Open, 2(6):e001461, 2012

work page 2012
[37]

Anatomical therapeutic chemical (ATC) classification

WHO. Anatomical therapeutic chemical (ATC) classification. Technical Report, 2021

work page 2021
[38]

R. Wu, Z. Qiu, J. Jiang, G. Qi, and X. Wu. Conditional generation net for medication recommendation. In ACM Web Conference, 2022

work page 2022
[39]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. Preprint arXiv:2505.09388, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

C. Yang, C. Xiao, F. Ma, L. Glass, and J. Sun. SafeDrug: Dual molecular graph encoders for recommending effective and safe drug combinations. Preprint arXiv:2105.02711, 2021

work page arXiv 2021
[41]

J. Zhao, L. Xu, M. Tan, L. Zhang, A. Argha, H. Alinejad-Rokny, and M. Yang. RxSafeBench: Identifying medication safety issues of large language models in simulated consultation. Preprint arXiv:2511.04328, 2025

work page arXiv 2025
[42]

Addressing overprescribing challenges: Fine-tuning large language models for medication recommendation tasks.arXiv preprint arXiv:2503.03687.2025

Z. Zhao, C. Fan, J. Liu, Z. Wang, X. He, C. Gao, J. Li, and F. Feng. Fine-grained alignment of large language models for general medication recommendation without overprescription. Preprint arXiv:2503.03687, 2025. 12 A Related Works Medical QA Benchmarks.A growing family of benchmarks evaluates LLM medical competence through multiple-choice question answe...

work page arXiv 2025
[43]

SKIP if one medication name appears twice

work page
[44]

SKIP if no medication is REASONED_NEW with timeline_dependent=true

work page
[45]

prescription_time

PROCEED otherwise. OUTPUT (JSON): { "prescription_time": "...", "medications": [ { "medication": "<drug>", "classification": "REASONED_NEW|CONTINUATION|ROUTINE", "timeline_dependent": true|false, "triggering_event": "<timestamp + event, or null>", "reason": "<1 sentence>" } ], "decision": "PROCEED|SKIP", "reason": "<1 sentence>" } Prompt 3: Reasoning Chai...

work page 2024
[46]

Every cited fact exists in the inputs with correct values

work page
[47]

No step cites its own medication or events after {prescription_time}

work page
[48]

Each non-root step genuinely builds on its depends_on steps, not just introduces an independent fact

work page
[49]

Inferences follow from the cited facts: no unsupported leaps, and no treating mere temporal association as causal rationale

work page
[50]

Each inference must be medically correct given the cited facts (right interpretation of labs/conditions; drug is genuinely indicated)

work page
[51]

REJECT if a chain with no independent triggering event (empty or single-step restatement of the prescription)

The chain’s cumulative evidence must be SUFFICIENT to justify prescribing this specific medication. REJECT if a chain with no independent triggering event (empty or single-step restatement of the prescription)

work page
[52]

unclear

The chain’s summary and all inference fields must make a committed assertion about the core indication: not concede it is unknown or speculative (e.g., " unclear", "presumably", "may have been given for"). Uncertainty about peripheral details is acceptable; REJECT if uncertainty about the core indication. CHECK THE ENTIRE TIMEPOINT:

work page
[53]

If none is, REJECT the entire timepoint

At least one chain must be driven by a new in-hospital event from the timeline, not just admission info, home meds, or standard protocols. If none is, REJECT the entire timepoint

work page
[54]

prescription_time

No two chains may contradict each other on the same clinical fact or its interpretation. KEY PRINCIPLE: Clinically plausible does not equals to evidenced. If reasoning is sensible but not supported by THIS patient’s data, REVISE it. A chain that openly admits uncertainty about why the drug was prescribed is NOT a valid chain, regardless of how well the su...

work page