Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Alex J. Goodell; Daniel Tawfik; Junze Ye; Mark K. Buyyounouski; Mohsen Bayati; Nikhil V. Kotha

arxiv: 2512.19691 · v3 · submitted 2025-12-22 · 💻 cs.AI · stat.AP

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Junze Ye , Daniel Tawfik , Alex J. Goodell , Nikhil V. Kotha , Mark K. Buyyounouski , Mohsen Bayati This is my paper

Pith reviewed 2026-05-16 20:19 UTC · model grok-4.3

classification 💻 cs.AI stat.AP

keywords LLM benchmarksclinical AIlabel errorsphysician validationMedCalc-Benchbenchmark auditingreinforcement learningstewardship pipeline

0 comments

The pith

At least 27% of labels in an LLM-assisted clinical benchmark are erroneous or incomputable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits MedCalc-Bench, a clinical benchmark for medical score computation whose labels were partly derived with LLM assistance. It develops a scalable physician-in-the-loop stewardship pipeline and determines that at least 27% of test labels are likely erroneous or incomputable. On a 50-instance physician-validated subset, recomputed labels agree with physician ground truth 74% of the time versus 20% for the originals. Using the original labels underestimates frontier LLM accuracy by 16-23 percentage points, while training on recomputed labels improves performance by 13.5 points on physician-labeled instances with gains that extend to related tasks. This demonstrates that LLM-assisted benchmarks require active oversight to avoid propagating systematic errors into evaluation and post-training.

Core claim

The central claim is that LLM-assisted reference labels in MedCalc-Bench contain substantial errors, with at least 27% of the test set likely erroneous or incomputable from the inputs. A physician-in-the-loop pipeline recomputes the labels and achieves 74% agreement with physician ground truth on a validated 50-instance subset, compared with 20% agreement for the original labels. Evaluation of frontier LLMs on the original labels underestimates accuracy by 16-23 percentage points. A controlled reinforcement-learning experiment shows that a model trained on the recomputed labels outperforms one trained on the originals by 13.5 percentage points on physician-labeled instances, and the benefit,

What carries the argument

The scalable physician-in-the-loop stewardship pipeline that recomputes benchmark labels and validates them against physician judgments.

Load-bearing premise

The 50-instance physician-validated subset is representative of the full test set and that physician labels constitute reliable ground truth without further adjudication.

What would settle it

Re-validating a substantially larger random sample of the full test set with multiple physicians to check whether the 27% error rate and the 74% versus 20% agreement rates hold.

Figures

Figures reproduced from arXiv: 2512.19691 by Alex J. Goodell, Daniel Tawfik, Junze Ye, Mark K. Buyyounouski, Mohsen Bayati, Nikhil V. Kotha.

**Figure 1.** Figure 1: Our benchmark stewardship utilizes two distinct LLM agent workflows (Phases 1 & 2) to assure MedCalc-Bench’s label quality. Besides their prompt and output type (Yes/No verdict versus a recomputed label), the two workflows differ in what the LLM agent is given as context: the Phase 1 agent is shown the original reference label and its derivation metadata, whereas the Phase 2 agent is only provided the pati… view at source ↗

**Figure 2.** Figure 2: Representative error types. (a) Feature extraction error: GPT-4 might have confused “hemoglobin” with “albumin”, extracting a value that is physiologically impossible; (b) Incorrect aggregation logic: an incorrect Python code for Glasgow Coma Scale aggregation that doublecounts a feature value, inflating yˆ original; (c) q is not answerable given C: a Sodium correction for hyperglycemia inappropriately ap… view at source ↗

**Figure 3.** Figure 3: Label Instantiation Changes Alignment Interpretation. Test accuracy dynamics for Qwen3-8B trained via GRPO using the recomputed reference labels (green) versus the original MedCalc-Bench labels (grey). Shaded bands indicate ±1σ smoothed over a 10-step window. Improving the reward signal’s factual grounding effects a +8.7% absolute gain in the final moving averages of model test accuracy (71.4% vs. 62.6%).… view at source ↗

**Figure 4.** Figure 4: Prompts for the controlled RL experiment in §2.6, which are identically used across the two comparison groups. These prompts are fed as context into a Qwen3-8B model checkpoint that parametrizes the RL policy. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Prompts for the LLM audit agent executing the Phase 1 workflow (§2.3.) 26 [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Prompts for the LLM relabeling agent executing the Phase 2 workflow (§2.4). Note that unlike [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Benchmark performance of four Claude-series models was reported by Anthropic in Exhibit #1 of their Jan 11, 2026 press release.21 Screenshot taken on Jan 17, 2026. We configured our API call pipeline to match the setup described in the first two bullet-point footnotes. MedAgentBench [48] is unrelated to our study. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗

**Figure 8.** Figure 8: MedCalc-Bench accuracy comparison between Anthropic’s official results (shown in [PITH_FULL_IMAGE:figures/full_fig_p045_8.png] view at source ↗

**Figure 9.** Figure 9: API call prompt template, applied identically to both treatment groups in §G.1. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_9.png] view at source ↗

read the original abstract

Reference labels for machine-learning benchmarks are increasingly synthesized with LLM assistance, but their reliability remains underexamined. We audit MedCalc-Bench, a clinical benchmark for medical score computation whose labels were partly derived with LLM assistance, and develop a scalable physician-in-the-loop stewardship pipeline to reassess them. At least 27% of test labels are likely erroneous or incomputable. On a 50-instance subset validated by physicians, our recomputed labels agree with physician ground truth 74% of the time (95% CI, 60-84%) versus 20% for the originals (95% CI, 11-33%). Using original labels to evaluate frontier LLMs underestimates accuracy by 16-23 percentage points. In a controlled reinforcement-learning experiment, a model trained on recomputed labels outperforms one trained on originals by 13.5 percentage points (95% CI, 10.6-16.6%) on physician-labeled instances, and this advantage extends to related medical tasks. LLM-assisted benchmarks can propagate systematic errors into both evaluation and post-training unless actively stewarded.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper audits MedCalc-Bench, a clinical benchmark whose labels were partly LLM-assisted, and introduces a physician-in-the-loop stewardship pipeline. It reports that at least 27% of test labels are erroneous or incomputable; on a 50-instance physician-validated subset, recomputed labels agree with physician ground truth 74% of the time (95% CI 60-84%) versus 20% (95% CI 11-33%) for the originals. This leads to 16-23pp underestimation of frontier LLM accuracy, and a controlled RL experiment shows models trained on recomputed labels outperform those trained on originals by 13.5pp (95% CI 10.6-16.6%) on physician-labeled instances, with the advantage extending to related tasks.

Significance. If the central findings hold, the work is significant for AI-assisted clinical benchmarks because it supplies direct, quantitative evidence (agreement rates with CIs and a controlled RL delta) that unverified LLM labels can propagate systematic errors into both evaluation and post-training. The physician-in-the-loop pipeline is presented as scalable and reproducible, offering a concrete template for stewardship that could raise standards in medical ML benchmarks.

major comments (3)

[Physician-validated subset (results section describing the 50-instance audit)] The selection process for the 50-instance physician-validated subset is unspecified (random sampling? stratification? criteria favoring recomputable cases?). This is load-bearing for the headline claim of at least 27% erroneous labels on the full test set, because the extrapolation and the reported 74% vs 20% gap both rest on the subset being representative.
[Physician validation procedure (methods and results on ground-truth construction)] Physician labels are used as ground truth for the agreement rates and RL evaluation without any inter-rater reliability metric (e.g., pairwise agreement or kappa). Single-rater variability could inflate or deflate the 74% vs 20% difference and the 13.5pp RL advantage, both of which are central to the stewardship argument.
[Results on label error rate (the paragraph reporting the 27% statistic)] The 27% figure on the full test set is stated as 'likely erroneous or incomputable' but the exact derivation (direct count, model-based inference, or subset extrapolation) is not detailed enough to assess whether it inherits the same selection bias as the 50-instance sample.

minor comments (2)

[Abstract and results tables] The 95% CIs are reported but the exact statistical method (binomial, bootstrap, etc.) and software used should be stated explicitly for reproducibility.
[RL experiment subsection] The RL experiment description would benefit from a brief diagram or pseudocode showing how the recomputed vs original label sets were used for training and how the physician-labeled test instances were held out.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us strengthen the manuscript. We address each major point below with clarifications and revisions where needed. The physician-in-the-loop pipeline remains a core contribution, and we have improved transparency around sampling, validation, and error-rate derivation without altering the central findings.

read point-by-point responses

Referee: [Physician-validated subset (results section describing the 50-instance audit)] The selection process for the 50-instance physician-validated subset is unspecified (random sampling? stratification? criteria favoring recomputable cases?). This is load-bearing for the headline claim of at least 27% erroneous labels on the full test set, because the extrapolation and the reported 74% vs 20% gap both rest on the subset being representative.

Authors: The 50-instance subset was drawn via simple random sampling from the full test set using a fixed seed for reproducibility; no stratification or selection favoring recomputable cases was applied. We have revised the Methods section to state this explicitly, including the sampling procedure and seed, so that readers can directly assess representativeness. This clarification supports the reported agreement rates and the extrapolation to the 27% figure on the full set. revision: yes
Referee: [Physician validation procedure (methods and results on ground-truth construction)] Physician labels are used as ground truth for the agreement rates and RL evaluation without any inter-rater reliability metric (e.g., pairwise agreement or kappa). Single-rater variability could inflate or deflate the 74% vs 20% difference and the 13.5pp RL advantage, both of which are central to the stewardship argument.

Authors: We acknowledge that the use of a single board-certified physician as ground truth is a limitation, as inter-rater reliability was not measured. The physician followed a standardized protocol based on published clinical guidelines and was blinded to original labels. We have added an explicit limitations paragraph discussing single-rater variability and its potential impact on the observed gaps, while noting that the magnitude of the 74% vs 20% difference makes systematic bias unlikely to reverse the direction of the findings. Future extensions could include multiple raters. revision: partial
Referee: [Results on label error rate (the paragraph reporting the 27% statistic)] The 27% figure on the full test set is stated as 'likely erroneous or incomputable' but the exact derivation (direct count, model-based inference, or subset extrapolation) is not detailed enough to assess whether it inherits the same selection bias as the 50-instance sample.

Authors: The 27% figure is obtained by direct enumeration over the entire test set: we flagged every label that was either incomputable (missing required inputs) or violated the original MedCalc computation rules. It is not an extrapolation from the 50-instance subset and therefore does not inherit its sampling properties. We have expanded the Results section with the precise counting procedure, a breakdown by error type, and a statement confirming the full-set scope. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external physician validation

full rationale

The paper's central results (≥27% erroneous labels, 74% vs 20% agreement, 13.5pp RL gain) are obtained by direct recomputation and comparison against independent physician-provided ground truth on a 50-instance subset. These are empirical measurements against an external reference rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or ansatzes are invoked that collapse the output to the input by construction. The derivation chain is therefore self-contained against the physician-labeled instances.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that physician labels serve as ground truth and that the 50-instance sample is representative; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Physician-provided labels constitute reliable ground truth for the audited instances
Invoked when reporting agreement rates between recomputed labels and physician judgments

pith-pipeline@v0.9.0 · 5514 in / 1191 out tokens · 38225 ms · 2026-05-16T20:19:33.048964+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 5 internal anchors

[1]

Estimating the attributable cost of physician burnout in the United States

Han S, Shanafelt TD, Sinsky CA, et al. Estimating the attributable cost of physician burnout in the United States. Ann Intern Med 2019;170:784–90.doi:10.7326/M18- 1422

work page doi:10.7326/m18- 2019
[2]

Health CareExpendituresAttributabletoPrimaryCarePhysicianOverallandBurnout-Related Turnover: A Cross-sectional Analysis

Sinsky CA, Shanafelt TD, Dyrbye LN, Sabety AH, Carlasare LE, and West CP. Health CareExpendituresAttributabletoPrimaryCarePhysicianOverallandBurnout-Related Turnover: A Cross-sectional Analysis. Mayo Clin Proc 2022;97:693–702.doi:10.1016/ j.mayocp.2021.09.013

work page 2022
[3]

Evaluating the Costs of Nurse Burnout-Attributed Turnover: A Markov Modeling Approach

Muir KJ, Wanchek TN, Lobo JM, and Keim-Malpass J. Evaluating the Costs of Nurse Burnout-Attributed Turnover: A Markov Modeling Approach. J Patient Saf 2022;18:351–7.doi:10.1097/PTS.0000000000000920

work page doi:10.1097/pts.0000000000000920 2022
[4]

Predicting Primary Care Physician Burnout From Electronic Health Record Use Measures

Tawfik D, Bayati M, Liu J, et al. Predicting Primary Care Physician Burnout From Electronic Health Record Use Measures. Mayo Clin Proc 2024;99:1411–21.doi:10. 1016/j.mayocp.2024.01.005

work page 2024
[5]

Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation

Tierney AA, Gayre G, Hoberman B, et al. Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation. NEJM Catal Innov Care Deliv 2024;5:0404.doi:10.1056/CAT.23.0404

work page doi:10.1056/cat.23.0404 2024
[6]

A pragmatic randomized controlled trial of ambient artificial intelligence to improve health practitioner well-being

Afshar M, Ryan Baumann M, Resnik F, et al. A pragmatic randomized controlled trial of ambient artificial intelligence to improve health practitioner well-being. NEJM AI 2025;2:AIoa2500945.doi:10.1056/AIoa2500945

work page doi:10.1056/aioa2500945 2025
[7]

Large language models for preventing medication direction errors in online pharmacies

Pais C, Liu J, Voigt R, Gupta V, Wade E, and Bayati M. Large language models for preventing medication direction errors in online pharmacies. Nat Med 2024;30:1574–82. doi:10.1038/s41591-024-02933-8

work page doi:10.1038/s41591-024-02933-8 2024
[8]

AI-based clinical decision support for primary care: A real-world study ,

Korom R, Kiptinness S, Adan N, et al. AI-based Clinical Decision Support for Primary Care: A Real-World Study. arXiv preprint 2025.doi:10.48550/arXiv.2507.16947

work page doi:10.48550/arxiv.2507.16947 2025
[9]

NEJM AI 2025;3:AIdbp2500418.doi:10.1056/AIdbp2500418

ChungP,SwaminathanA,GoodellAJ,etal.VerifyingFactsinPatientCareDocuments Generated by Large Language Models Using Electronic Health Records. NEJM AI 2025;3:AIdbp2500418.doi:10.1056/AIdbp2500418

work page doi:10.1056/aidbp2500418 2025
[10]

Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the com- munity

Van Walraven C, Dhalla IA, Bell C, et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the com- munity. Can Med Assoc J 2010;182:551–7.doi:10.1503/cmaj.091117. 19

work page doi:10.1503/cmaj.091117 2010
[11]

Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study

Lim W, Eerden M van der, Laing R, et al. Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study. Thorax 2003;58:377–82.doi:10.1136/thorax.58.5.377

work page doi:10.1136/thorax.58.5.377 2003
[12]

Lip GY, Nieuwlaat R, Pisters R, Lane DA, and Crijns HJ. Refining clinical risk stratifi- cation for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the Euro Heart Survey on atrial fibrillation. Chest 2010;137:263– 72.doi:10.1378/chest.09-1584

work page doi:10.1378/chest.09-1584 2010
[13]

MDCalc.FrequentlyAskedQuestions.https://web.archive.org/web/20240405155011/ https://www.mdcalc.com/faq. 2024

work page arXiv 2024
[14]

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

Khandekar N, Jin Q, Xiong G, et al. MedCalc-Bench: Evaluating Large Language Mod- els for Medical Calculations. In:Advances in Neural Information Processing Systems. Vol. 37. (NeurIPS 2024 Datasets and Benchmark Track Oral). 2024:84730–45.doi: 10.52202/079017- 2690.url:https://proceedings.neurips.cc/paper_files/ paper / 2024 / file / 99e81750f3fdfcaf9613d...

work page doi:10.52202/079017- 2024
[15]

Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802,

Bedi S, Cui H, Fuentes M, et al. MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks. arXiv preprint 2025.doi:10.48550/arXiv.2505.23802

work page doi:10.48550/arxiv.2505.23802 2025
[16]

RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction

Liu F, Wu J, Zhou H, et al. RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction. medRxiv preprint 2025.doi:10.1101/2025.04.03.25323489

work page doi:10.1101/2025.04.03.25323489 2025
[17]

Training LLMs for EHR-Based Reasoning Tasks via Rein- forcement Learning

Lin J, Wu Z, and Sun J. Training LLMs for EHR-Based Reasoning Tasks via Rein- forcement Learning. arXiv preprint 2025.doi:10.48550/arXiv.2505.24105

work page doi:10.48550/arxiv.2505.24105 2025
[18]

Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning

Zhang X, Wang Y, Feng Z, et al. Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning. arXiv preprint 2025.doi:10.48550/ arXiv.2506.12307

work page arXiv 2025
[19]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Wang B, Xia I, Zhang Y, et al. From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations. In:Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing. Suzhou, China: Associ- ation for Computational Linguistics, 2025:10820–44.doi:10.18653/v1/2025.emnlp- main.548

work page doi:10.18653/v1/2025.emnlp- 2025
[20]

Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs

Oh G, Kim S, Park S, and Kim BH. Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs. arXiv preprint 2025.doi:10. 48550/arXiv.2506.13102. 20

work page arXiv 2025
[21]

Advancing Claude in healthcare and the life sciences

Anthropic. Advancing Claude in healthcare and the life sciences. Anthropic (News / Announcements). Archived by the Internet Archive Wayback Machine. See Figure 1 of their press release. 2026.url:https://web.archive.org/web/20260112115216/ https://www.anthropic.com/news/healthcare-life-sciences

work page arXiv 2026
[22]

The Briefing: Healthcare and Life Sciences

Anthropic. The Briefing: Healthcare and Life Sciences. YouTube video (livestream / on-demand recording). At 23:09-24:16, the presenter discusses Claude model series’ per- formance on MedCalc-Bench and MedAgentBench. 2026.url:https://www.youtube. com/watch?v=UXyVMGAFLAs(visited on 01/17/2026)

work page 2026
[23]

Accuracy and efficiency of drilling trajectories with augmented reality versus conventional navigation randomized crossover trial,

Goodell AJ, Chu SN, Rouholiman D, and Chu LF. Large language model agents can use tools to perform clinical calculations. npj Digit Med 2025;8:163.doi:10.1038/s41746- 025-01475-8

work page doi:10.1038/s41746- 2025
[24]

Development of an LLM Pipeline Sur- passing Physicians in Cardiovascular Risk Score Calculation

Roeschl T, Hoffmann M, Unbehaun A, et al. Development of an LLM Pipeline Sur- passing Physicians in Cardiovascular Risk Score Calculation. medRxiv 2025:2025–11. doi:10.1101/2025.11.11.25340002

work page doi:10.1101/2025.11.11.25340002 2025
[25]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. In:Advances in neural information processing systems. Vol. 35. 2022:27730–44

work page 2022
[26]

Back to Basics: Revisiting REINFORCE- Style Optimization for Learning from Human Feedback in LLMs

Ahmadian A, Cremer C, Gallé M, et al. Back to Basics: Revisiting REINFORCE- Style Optimization for Learning from Human Feedback in LLMs. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024:12248–67

work page 2024
[27]

Experiment with parameter values | Generative AI on Vertex AI

Google Cloud. Experiment with parameter values | Generative AI on Vertex AI. Ac- cessed: 2025-12-30. 2025.url:https : / / docs . cloud . google . com / vertex - ai / generative-ai/docs/learn/prompts/adjust-parameter-values

work page 2025
[28]

Long-Range Forecasting: From Crystal Ball to Computer

Armstrong JS. Long-Range Forecasting: From Crystal Ball to Computer. 2nd. New York: John Wiley & Sons, 1985.url:https://ssrn.com/abstract=666990

work page 1985
[29]

Qwen3 Technical Report

Qwen. Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL].url:https:// web.archive.org/web/20250829140319/https://huggingface.co/Qwen/Qwen3-8B

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao Z, Wang P, Zhu Q, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Calculating Total Daily Dose of Opioids for Safer Dosage

CDC. Calculating Total Daily Dose of Opioids for Safer Dosage. U.S. Department of Health and Human Services. Archived from the original on November 25, 2022. 2016. url:https://web.archive.org/web/20221125005318/https://www.cdc.gov/ drugoverdose/pdf/calculating_total_daily_dose-a.pdf. 21

work page arXiv 2022
[32]

CDC Clinical Practice GuidelineforPrescribingOpioidsforPain—UnitedStates,2022.2022.doi:10.15585/ mmwr.rr7103a1.url:https://www.cdc.gov/mmwr/volumes/71/rr/rr7103a1.htm

Dowell D, Ragan KR, Jones CM, Baldwin GT, and Chou R. CDC Clinical Practice GuidelineforPrescribingOpioidsforPain—UnitedStates,2022.2022.doi:10.15585/ mmwr.rr7103a1.url:https://www.cdc.gov/mmwr/volumes/71/rr/rr7103a1.htm

work page arXiv 2022
[33]

PDMP Morphine Milligram Equivalents Fact Sheet

Maryland Department of Health. PDMP Morphine Milligram Equivalents Fact Sheet. Prescription Drug Monitoring Program. Archived from the original on August 31, 2025. 2025.url:https : / / web . archive . org / web / 20250831211406 / https : / / health . maryland.gov/pdmp/Documents/Clinical%20Docs/MME%20Fact%20Sheet.pdf

work page 2025
[34]

Opioid Oral Morphine Milligram Equivalent (MME) Conversion Factors

Utah Department of Health and Human Services. Opioid Oral Morphine Milligram Equivalent (MME) Conversion Factors. Utah Medicaid. Archived from the original on August 1, 2025. 2025.url:https://web.archive.org/web/20250801184557/https: //medicaid-documents.dhhs.utah.gov/Documents/files/Opioid-Morphine-EQ- Conversion-Factors.pdf

work page arXiv 2025
[35]

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Gerstgrasser M, Schaeffer R, Dey A, et al. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. In:First Conference on Language Modeling. 2024.url:https://openreview.net/forum?id=5B2K4LRgmz

work page 2024
[36]

AI models collapse when trained on recursively generated data

Shumailov I, Shumaylov Z, Zhao Y, Papernot N, Anderson R, and Gal Y. AI models collapse when trained on recursively generated data. Nature 2024;631:755–9.doi:10. 1038/s41586-024-07566-y

work page 2024
[37]

A preliminary study of o1 in medicine: Are we closer to an ai doctor?

Xie Y, Wu J, Tu H, et al. A preliminary study of o1 in medicine: Are we closer to an ai doctor? arXiv preprint arXiv:2409.15277 2024

work page arXiv 2024
[38]

General scales unlock ai evaluation with explanatory and predictive power

Zhou L, Pacchiardi L, Martínez-Plumed F, et al. General scales unlock ai evaluation with explanatory and predictive power. arXiv preprint arXiv:2503.06378 2025

work page arXiv 2025
[39]

Large Language Model Agents for Biomedicine: A Comprehen- sive Review of Methods, Evaluations, Challenges, and Future Directions

Xu X and Sankar R. Large Language Model Agents for Biomedicine: A Comprehen- sive Review of Methods, Evaluations, Challenges, and Future Directions. Information 2025;16:894

work page 2025
[40]

Simple statistical gradient-following algorithms for connectionist rein- forcement learning

Williams RJ. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Machine learning 1992;8:229–56

work page 1992
[41]

Policy gradient methods for rein- forcement learning with function approximation

Sutton RS, McAllester D, Singh S, and Mansour Y. Policy gradient methods for rein- forcement learning with function approximation. Advances in neural information pro- cessing systems 1999;12

work page 1999
[42]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman J, Moritz P, Levine S, Jordan M, and Abbeel P. High-dimensional contin- uous control using generalized advantage estimation. In:International Conference on Learning Representations (ICLR). 2016.url:https://arxiv.org/abs/1506.02438. 22

work page internal anchor Pith review Pith/arXiv arXiv 2016
[43]

Buy 4 REINFORCE Samples, Get a Baseline for Free! ICLR 2019 Workshop drlStructPred

Kool W, Hoof H van, and Welling M. Buy 4 REINFORCE Samples, Get a Baseline for Free! ICLR 2019 Workshop drlStructPred. 2019.url:https://openreview.net/ forum?id=r1lgTGL5DE

work page 2019
[44]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo D, Yang D, Zhang H, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Training Language Models to Self-Correct via Reinforcement Learning

Kumar A, Zhuang V, Agarwal R, et al. Training Language Models to Self-Correct via Reinforcement Learning. In:The Thirteenth International Conference on Learning Representations. 2025.url:https://openreview.net/forum?id=CjwERcAU7w

work page 2025
[46]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu Z, Chen C, Li W, et al. Understanding r1-zero-like training: A critical perspective. arXiv preprint 2025.doi:10.48550/arXiv.2503.20783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20783 2025
[47]

Hybridflow: A flexible and efficient rlhf framework

Sheng G, Zhang C, Ye Z, et al. Hybridflow: A Flexible and Efficient RLHF Frame- work. In:Proceedings of the Twentieth European Conference on Computer Systems. 2025:1279–97.doi:10.1145/3689031.3696075

work page doi:10.1145/3689031.3696075 2025
[48]

unknown" for patient notes lacking necessary information. Note that

Jiang Y, Black KC, Geng G, et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI 2025;2:AIdbp2500144.doi:10 . 1056 / AIdbp2500144. 23 Supplementary Appendix Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight Junze (Tony) Ye, BS1,*; Daniel Tawfik, MD, MS2; Alex J. Goodell, MD, MS3; Nikhil ...

work page 2025
[49]

MedCalc-Bench- Verified

reexamined MedCalc-Bench by introducing an LLM-judged, stepwise evaluation pipeline to grade an LLM’s intermediate calculation steps at test time, which can be thought of as providing process-level rewards besides a binary final answer correctness. Although they recommended to remove 10.3% of the original test instances based on a clinician review, they o...

work page 2025
[50]

72M admitted to ED for CHF exac- erbation. Hospital stay was 4 days. Hx of diabetes (CCI=2). Visited the ED twice in last 5 months

real numbers, e.g. blood pressure, Likert scale response; 31 Table 2:Mapping Formalism to a Concrete Example: The LACE Score10 Mathematical Formal LACE Score for Readmission Symbol Definition Example C Patient context (EHR notes and lab data) “72M admitted to ED for CHF exac- erbation. Hospital stay was 4 days. Hx of diabetes (CCI=2). Visited the ED twice...

work page
[51]

categorical values, ranging from True/False to symptom groups—we call this a setC

work page
[52]

In sum,xi∈Xi :=R∪C∪{N/A}

a fallback undefined token denoted byN/A, indicating unextractability of a feature. In sum,xi∈Xi :=R∪C∪{N/A}. LetX≜∏m i=1Xi denote the feature space, a Cartesian product of individual per-feature domains. Lastly, for each(C,q)letybeanyanswer pre- dicted by an algorithm, andy∗be the (latent) clinically correct ground truth answer;y∗is a scalar if the quest...

work page
[53]

Therearealtogether 10,053 + 1,047 = 11,100 notes in the train and test sets

Patient notes are real, deidentified patient cases they scraped from journal-published medicalpapersthatarearchivedinthePubMedCentraldatabase. Therearealtogether 10,053 + 1,047 = 11,100 notes in the train and test sets. We note that each of these patient notes can be seen as a unique contextCin the notation of §D.1

work page
[54]

What is the patient’s Glasgow Coma Score?

The medical score questions (q’s), e.g. “What is the patient’s Glasgow Coma Score?”, are sourced as the 55 unique calculators listed as “popular” on MDCalc.com, a web app widely used by U.S. physicians. One can think of each train or test instance in the MedCalc-Bench dataset as uniquely identified by a pair(C,q). Many instances may share the same score q...

work page 2024
[55]

Read the patient note carefully and extract all relevant clinical values

work page
[56]

Identify the appropriate medical formula or calculation method needed

work page
[57]

Show your work step by step

Use Python code to perform the calculation accurately. Show your work step by step

work page
[58]

After completing your calculation, provide your final numerical answer

work page
[59]

Your final answer MUST be enclosed in <answer></answer> tags. For example, if your calculated answer is 42.5, you would write: <answer>42.5</answer> Now solve the problem: Figure 9:API call prompt template, applied identically to both treatment groups in §G.1. 46

work page

[1] [1]

Estimating the attributable cost of physician burnout in the United States

Han S, Shanafelt TD, Sinsky CA, et al. Estimating the attributable cost of physician burnout in the United States. Ann Intern Med 2019;170:784–90.doi:10.7326/M18- 1422

work page doi:10.7326/m18- 2019

[2] [2]

Health CareExpendituresAttributabletoPrimaryCarePhysicianOverallandBurnout-Related Turnover: A Cross-sectional Analysis

Sinsky CA, Shanafelt TD, Dyrbye LN, Sabety AH, Carlasare LE, and West CP. Health CareExpendituresAttributabletoPrimaryCarePhysicianOverallandBurnout-Related Turnover: A Cross-sectional Analysis. Mayo Clin Proc 2022;97:693–702.doi:10.1016/ j.mayocp.2021.09.013

work page 2022

[3] [3]

Evaluating the Costs of Nurse Burnout-Attributed Turnover: A Markov Modeling Approach

Muir KJ, Wanchek TN, Lobo JM, and Keim-Malpass J. Evaluating the Costs of Nurse Burnout-Attributed Turnover: A Markov Modeling Approach. J Patient Saf 2022;18:351–7.doi:10.1097/PTS.0000000000000920

work page doi:10.1097/pts.0000000000000920 2022

[4] [4]

Predicting Primary Care Physician Burnout From Electronic Health Record Use Measures

Tawfik D, Bayati M, Liu J, et al. Predicting Primary Care Physician Burnout From Electronic Health Record Use Measures. Mayo Clin Proc 2024;99:1411–21.doi:10. 1016/j.mayocp.2024.01.005

work page 2024

[5] [5]

Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation

Tierney AA, Gayre G, Hoberman B, et al. Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation. NEJM Catal Innov Care Deliv 2024;5:0404.doi:10.1056/CAT.23.0404

work page doi:10.1056/cat.23.0404 2024

[6] [6]

A pragmatic randomized controlled trial of ambient artificial intelligence to improve health practitioner well-being

Afshar M, Ryan Baumann M, Resnik F, et al. A pragmatic randomized controlled trial of ambient artificial intelligence to improve health practitioner well-being. NEJM AI 2025;2:AIoa2500945.doi:10.1056/AIoa2500945

work page doi:10.1056/aioa2500945 2025

[7] [7]

Large language models for preventing medication direction errors in online pharmacies

Pais C, Liu J, Voigt R, Gupta V, Wade E, and Bayati M. Large language models for preventing medication direction errors in online pharmacies. Nat Med 2024;30:1574–82. doi:10.1038/s41591-024-02933-8

work page doi:10.1038/s41591-024-02933-8 2024

[8] [8]

AI-based clinical decision support for primary care: A real-world study ,

Korom R, Kiptinness S, Adan N, et al. AI-based Clinical Decision Support for Primary Care: A Real-World Study. arXiv preprint 2025.doi:10.48550/arXiv.2507.16947

work page doi:10.48550/arxiv.2507.16947 2025

[9] [9]

NEJM AI 2025;3:AIdbp2500418.doi:10.1056/AIdbp2500418

ChungP,SwaminathanA,GoodellAJ,etal.VerifyingFactsinPatientCareDocuments Generated by Large Language Models Using Electronic Health Records. NEJM AI 2025;3:AIdbp2500418.doi:10.1056/AIdbp2500418

work page doi:10.1056/aidbp2500418 2025

[10] [10]

Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the com- munity

Van Walraven C, Dhalla IA, Bell C, et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the com- munity. Can Med Assoc J 2010;182:551–7.doi:10.1503/cmaj.091117. 19

work page doi:10.1503/cmaj.091117 2010

[11] [11]

Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study

Lim W, Eerden M van der, Laing R, et al. Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study. Thorax 2003;58:377–82.doi:10.1136/thorax.58.5.377

work page doi:10.1136/thorax.58.5.377 2003

[12] [12]

Lip GY, Nieuwlaat R, Pisters R, Lane DA, and Crijns HJ. Refining clinical risk stratifi- cation for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the Euro Heart Survey on atrial fibrillation. Chest 2010;137:263– 72.doi:10.1378/chest.09-1584

work page doi:10.1378/chest.09-1584 2010

[13] [13]

MDCalc.FrequentlyAskedQuestions.https://web.archive.org/web/20240405155011/ https://www.mdcalc.com/faq. 2024

work page arXiv 2024

[14] [14]

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

Khandekar N, Jin Q, Xiong G, et al. MedCalc-Bench: Evaluating Large Language Mod- els for Medical Calculations. In:Advances in Neural Information Processing Systems. Vol. 37. (NeurIPS 2024 Datasets and Benchmark Track Oral). 2024:84730–45.doi: 10.52202/079017- 2690.url:https://proceedings.neurips.cc/paper_files/ paper / 2024 / file / 99e81750f3fdfcaf9613d...

work page doi:10.52202/079017- 2024

[15] [15]

Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802,

Bedi S, Cui H, Fuentes M, et al. MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks. arXiv preprint 2025.doi:10.48550/arXiv.2505.23802

work page doi:10.48550/arxiv.2505.23802 2025

[16] [16]

RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction

Liu F, Wu J, Zhou H, et al. RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction. medRxiv preprint 2025.doi:10.1101/2025.04.03.25323489

work page doi:10.1101/2025.04.03.25323489 2025

[17] [17]

Training LLMs for EHR-Based Reasoning Tasks via Rein- forcement Learning

Lin J, Wu Z, and Sun J. Training LLMs for EHR-Based Reasoning Tasks via Rein- forcement Learning. arXiv preprint 2025.doi:10.48550/arXiv.2505.24105

work page doi:10.48550/arxiv.2505.24105 2025

[18] [18]

Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning

Zhang X, Wang Y, Feng Z, et al. Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning. arXiv preprint 2025.doi:10.48550/ arXiv.2506.12307

work page arXiv 2025

[19] [19]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Wang B, Xia I, Zhang Y, et al. From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations. In:Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing. Suzhou, China: Associ- ation for Computational Linguistics, 2025:10820–44.doi:10.18653/v1/2025.emnlp- main.548

work page doi:10.18653/v1/2025.emnlp- 2025

[20] [20]

Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs

Oh G, Kim S, Park S, and Kim BH. Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs. arXiv preprint 2025.doi:10. 48550/arXiv.2506.13102. 20

work page arXiv 2025

[21] [21]

Advancing Claude in healthcare and the life sciences

Anthropic. Advancing Claude in healthcare and the life sciences. Anthropic (News / Announcements). Archived by the Internet Archive Wayback Machine. See Figure 1 of their press release. 2026.url:https://web.archive.org/web/20260112115216/ https://www.anthropic.com/news/healthcare-life-sciences

work page arXiv 2026

[22] [22]

The Briefing: Healthcare and Life Sciences

Anthropic. The Briefing: Healthcare and Life Sciences. YouTube video (livestream / on-demand recording). At 23:09-24:16, the presenter discusses Claude model series’ per- formance on MedCalc-Bench and MedAgentBench. 2026.url:https://www.youtube. com/watch?v=UXyVMGAFLAs(visited on 01/17/2026)

work page 2026

[23] [23]

Accuracy and efficiency of drilling trajectories with augmented reality versus conventional navigation randomized crossover trial,

Goodell AJ, Chu SN, Rouholiman D, and Chu LF. Large language model agents can use tools to perform clinical calculations. npj Digit Med 2025;8:163.doi:10.1038/s41746- 025-01475-8

work page doi:10.1038/s41746- 2025

[24] [24]

Development of an LLM Pipeline Sur- passing Physicians in Cardiovascular Risk Score Calculation

Roeschl T, Hoffmann M, Unbehaun A, et al. Development of an LLM Pipeline Sur- passing Physicians in Cardiovascular Risk Score Calculation. medRxiv 2025:2025–11. doi:10.1101/2025.11.11.25340002

work page doi:10.1101/2025.11.11.25340002 2025

[25] [25]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. In:Advances in neural information processing systems. Vol. 35. 2022:27730–44

work page 2022

[26] [26]

Back to Basics: Revisiting REINFORCE- Style Optimization for Learning from Human Feedback in LLMs

Ahmadian A, Cremer C, Gallé M, et al. Back to Basics: Revisiting REINFORCE- Style Optimization for Learning from Human Feedback in LLMs. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024:12248–67

work page 2024

[27] [27]

Experiment with parameter values | Generative AI on Vertex AI

Google Cloud. Experiment with parameter values | Generative AI on Vertex AI. Ac- cessed: 2025-12-30. 2025.url:https : / / docs . cloud . google . com / vertex - ai / generative-ai/docs/learn/prompts/adjust-parameter-values

work page 2025

[28] [28]

Long-Range Forecasting: From Crystal Ball to Computer

Armstrong JS. Long-Range Forecasting: From Crystal Ball to Computer. 2nd. New York: John Wiley & Sons, 1985.url:https://ssrn.com/abstract=666990

work page 1985

[29] [29]

Qwen3 Technical Report

Qwen. Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL].url:https:// web.archive.org/web/20250829140319/https://huggingface.co/Qwen/Qwen3-8B

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao Z, Wang P, Zhu Q, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Calculating Total Daily Dose of Opioids for Safer Dosage

CDC. Calculating Total Daily Dose of Opioids for Safer Dosage. U.S. Department of Health and Human Services. Archived from the original on November 25, 2022. 2016. url:https://web.archive.org/web/20221125005318/https://www.cdc.gov/ drugoverdose/pdf/calculating_total_daily_dose-a.pdf. 21

work page arXiv 2022

[32] [32]

CDC Clinical Practice GuidelineforPrescribingOpioidsforPain—UnitedStates,2022.2022.doi:10.15585/ mmwr.rr7103a1.url:https://www.cdc.gov/mmwr/volumes/71/rr/rr7103a1.htm

Dowell D, Ragan KR, Jones CM, Baldwin GT, and Chou R. CDC Clinical Practice GuidelineforPrescribingOpioidsforPain—UnitedStates,2022.2022.doi:10.15585/ mmwr.rr7103a1.url:https://www.cdc.gov/mmwr/volumes/71/rr/rr7103a1.htm

work page arXiv 2022

[33] [33]

PDMP Morphine Milligram Equivalents Fact Sheet

Maryland Department of Health. PDMP Morphine Milligram Equivalents Fact Sheet. Prescription Drug Monitoring Program. Archived from the original on August 31, 2025. 2025.url:https : / / web . archive . org / web / 20250831211406 / https : / / health . maryland.gov/pdmp/Documents/Clinical%20Docs/MME%20Fact%20Sheet.pdf

work page 2025

[34] [34]

Opioid Oral Morphine Milligram Equivalent (MME) Conversion Factors

Utah Department of Health and Human Services. Opioid Oral Morphine Milligram Equivalent (MME) Conversion Factors. Utah Medicaid. Archived from the original on August 1, 2025. 2025.url:https://web.archive.org/web/20250801184557/https: //medicaid-documents.dhhs.utah.gov/Documents/files/Opioid-Morphine-EQ- Conversion-Factors.pdf

work page arXiv 2025

[35] [35]

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Gerstgrasser M, Schaeffer R, Dey A, et al. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. In:First Conference on Language Modeling. 2024.url:https://openreview.net/forum?id=5B2K4LRgmz

work page 2024

[36] [36]

AI models collapse when trained on recursively generated data

Shumailov I, Shumaylov Z, Zhao Y, Papernot N, Anderson R, and Gal Y. AI models collapse when trained on recursively generated data. Nature 2024;631:755–9.doi:10. 1038/s41586-024-07566-y

work page 2024

[37] [37]

A preliminary study of o1 in medicine: Are we closer to an ai doctor?

Xie Y, Wu J, Tu H, et al. A preliminary study of o1 in medicine: Are we closer to an ai doctor? arXiv preprint arXiv:2409.15277 2024

work page arXiv 2024

[38] [38]

General scales unlock ai evaluation with explanatory and predictive power

Zhou L, Pacchiardi L, Martínez-Plumed F, et al. General scales unlock ai evaluation with explanatory and predictive power. arXiv preprint arXiv:2503.06378 2025

work page arXiv 2025

[39] [39]

Large Language Model Agents for Biomedicine: A Comprehen- sive Review of Methods, Evaluations, Challenges, and Future Directions

Xu X and Sankar R. Large Language Model Agents for Biomedicine: A Comprehen- sive Review of Methods, Evaluations, Challenges, and Future Directions. Information 2025;16:894

work page 2025

[40] [40]

Simple statistical gradient-following algorithms for connectionist rein- forcement learning

Williams RJ. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Machine learning 1992;8:229–56

work page 1992

[41] [41]

Policy gradient methods for rein- forcement learning with function approximation

Sutton RS, McAllester D, Singh S, and Mansour Y. Policy gradient methods for rein- forcement learning with function approximation. Advances in neural information pro- cessing systems 1999;12

work page 1999

[42] [42]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman J, Moritz P, Levine S, Jordan M, and Abbeel P. High-dimensional contin- uous control using generalized advantage estimation. In:International Conference on Learning Representations (ICLR). 2016.url:https://arxiv.org/abs/1506.02438. 22

work page internal anchor Pith review Pith/arXiv arXiv 2016

[43] [43]

Buy 4 REINFORCE Samples, Get a Baseline for Free! ICLR 2019 Workshop drlStructPred

Kool W, Hoof H van, and Welling M. Buy 4 REINFORCE Samples, Get a Baseline for Free! ICLR 2019 Workshop drlStructPred. 2019.url:https://openreview.net/ forum?id=r1lgTGL5DE

work page 2019

[44] [44]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo D, Yang D, Zhang H, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Training Language Models to Self-Correct via Reinforcement Learning

Kumar A, Zhuang V, Agarwal R, et al. Training Language Models to Self-Correct via Reinforcement Learning. In:The Thirteenth International Conference on Learning Representations. 2025.url:https://openreview.net/forum?id=CjwERcAU7w

work page 2025

[46] [46]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu Z, Chen C, Li W, et al. Understanding r1-zero-like training: A critical perspective. arXiv preprint 2025.doi:10.48550/arXiv.2503.20783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20783 2025

[47] [47]

Hybridflow: A flexible and efficient rlhf framework

Sheng G, Zhang C, Ye Z, et al. Hybridflow: A Flexible and Efficient RLHF Frame- work. In:Proceedings of the Twentieth European Conference on Computer Systems. 2025:1279–97.doi:10.1145/3689031.3696075

work page doi:10.1145/3689031.3696075 2025

[48] [48]

unknown" for patient notes lacking necessary information. Note that

Jiang Y, Black KC, Geng G, et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI 2025;2:AIdbp2500144.doi:10 . 1056 / AIdbp2500144. 23 Supplementary Appendix Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight Junze (Tony) Ye, BS1,*; Daniel Tawfik, MD, MS2; Alex J. Goodell, MD, MS3; Nikhil ...

work page 2025

[49] [49]

MedCalc-Bench- Verified

reexamined MedCalc-Bench by introducing an LLM-judged, stepwise evaluation pipeline to grade an LLM’s intermediate calculation steps at test time, which can be thought of as providing process-level rewards besides a binary final answer correctness. Although they recommended to remove 10.3% of the original test instances based on a clinician review, they o...

work page 2025

[50] [50]

72M admitted to ED for CHF exac- erbation. Hospital stay was 4 days. Hx of diabetes (CCI=2). Visited the ED twice in last 5 months

real numbers, e.g. blood pressure, Likert scale response; 31 Table 2:Mapping Formalism to a Concrete Example: The LACE Score10 Mathematical Formal LACE Score for Readmission Symbol Definition Example C Patient context (EHR notes and lab data) “72M admitted to ED for CHF exac- erbation. Hospital stay was 4 days. Hx of diabetes (CCI=2). Visited the ED twice...

work page

[51] [51]

categorical values, ranging from True/False to symptom groups—we call this a setC

work page

[52] [52]

In sum,xi∈Xi :=R∪C∪{N/A}

a fallback undefined token denoted byN/A, indicating unextractability of a feature. In sum,xi∈Xi :=R∪C∪{N/A}. LetX≜∏m i=1Xi denote the feature space, a Cartesian product of individual per-feature domains. Lastly, for each(C,q)letybeanyanswer pre- dicted by an algorithm, andy∗be the (latent) clinically correct ground truth answer;y∗is a scalar if the quest...

work page

[53] [53]

Therearealtogether 10,053 + 1,047 = 11,100 notes in the train and test sets

Patient notes are real, deidentified patient cases they scraped from journal-published medicalpapersthatarearchivedinthePubMedCentraldatabase. Therearealtogether 10,053 + 1,047 = 11,100 notes in the train and test sets. We note that each of these patient notes can be seen as a unique contextCin the notation of §D.1

work page

[54] [54]

What is the patient’s Glasgow Coma Score?

The medical score questions (q’s), e.g. “What is the patient’s Glasgow Coma Score?”, are sourced as the 55 unique calculators listed as “popular” on MDCalc.com, a web app widely used by U.S. physicians. One can think of each train or test instance in the MedCalc-Bench dataset as uniquely identified by a pair(C,q). Many instances may share the same score q...

work page 2024

[55] [55]

Read the patient note carefully and extract all relevant clinical values

work page

[56] [56]

Identify the appropriate medical formula or calculation method needed

work page

[57] [57]

Show your work step by step

Use Python code to perform the calculation accurately. Show your work step by step

work page

[58] [58]

After completing your calculation, provide your final numerical answer

work page

[59] [59]

Your final answer MUST be enclosed in <answer></answer> tags. For example, if your calculated answer is 42.5, you would write: <answer>42.5</answer> Now solve the problem: Figure 9:API call prompt template, applied identically to both treatment groups in §G.1. 46

work page