pith. sign in

arxiv: 2512.19691 · v3 · submitted 2025-12-22 · 💻 cs.AI · stat.AP

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Pith reviewed 2026-05-16 20:19 UTC · model grok-4.3

classification 💻 cs.AI stat.AP
keywords LLM benchmarksclinical AIlabel errorsphysician validationMedCalc-Benchbenchmark auditingreinforcement learningstewardship pipeline
0
0 comments X

The pith

At least 27% of labels in an LLM-assisted clinical benchmark are erroneous or incomputable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits MedCalc-Bench, a clinical benchmark for medical score computation whose labels were partly derived with LLM assistance. It develops a scalable physician-in-the-loop stewardship pipeline and determines that at least 27% of test labels are likely erroneous or incomputable. On a 50-instance physician-validated subset, recomputed labels agree with physician ground truth 74% of the time versus 20% for the originals. Using the original labels underestimates frontier LLM accuracy by 16-23 percentage points, while training on recomputed labels improves performance by 13.5 points on physician-labeled instances with gains that extend to related tasks. This demonstrates that LLM-assisted benchmarks require active oversight to avoid propagating systematic errors into evaluation and post-training.

Core claim

The central claim is that LLM-assisted reference labels in MedCalc-Bench contain substantial errors, with at least 27% of the test set likely erroneous or incomputable from the inputs. A physician-in-the-loop pipeline recomputes the labels and achieves 74% agreement with physician ground truth on a validated 50-instance subset, compared with 20% agreement for the original labels. Evaluation of frontier LLMs on the original labels underestimates accuracy by 16-23 percentage points. A controlled reinforcement-learning experiment shows that a model trained on the recomputed labels outperforms one trained on the originals by 13.5 percentage points on physician-labeled instances, and the benefit,

What carries the argument

The scalable physician-in-the-loop stewardship pipeline that recomputes benchmark labels and validates them against physician judgments.

Load-bearing premise

The 50-instance physician-validated subset is representative of the full test set and that physician labels constitute reliable ground truth without further adjudication.

What would settle it

Re-validating a substantially larger random sample of the full test set with multiple physicians to check whether the 27% error rate and the 74% versus 20% agreement rates hold.

Figures

Figures reproduced from arXiv: 2512.19691 by Alex J. Goodell, Daniel Tawfik, Junze Ye, Mark K. Buyyounouski, Mohsen Bayati, Nikhil V. Kotha.

Figure 1
Figure 1. Figure 1: Our benchmark stewardship utilizes two distinct LLM agent workflows (Phases 1 & 2) to assure MedCalc-Bench’s label quality. Besides their prompt and output type (Yes/No verdict versus a recomputed label), the two workflows differ in what the LLM agent is given as context: the Phase 1 agent is shown the original reference label and its derivation metadata, whereas the Phase 2 agent is only provided the pati… view at source ↗
Figure 2
Figure 2. Figure 2: Representative error types. (a) Feature extraction error: GPT-4 might have confused “hemoglobin” with “albumin”, extracting a value that is physiologically impossible; (b) Incorrect aggregation logic: an incorrect Python code for Glasgow Coma Scale aggregation that double￾counts a feature value, inflating yˆ original; (c) q is not answerable given C: a Sodium correction for hyperglycemia inappropriately ap… view at source ↗
Figure 3
Figure 3. Figure 3: Label Instantiation Changes Alignment Interpretation. Test accuracy dynamics for Qwen3-8B trained via GRPO using the recomputed reference labels (green) versus the original MedCalc-Bench labels (grey). Shaded bands indicate ±1σ smoothed over a 10-step window. Im￾proving the reward signal’s factual grounding effects a +8.7% absolute gain in the final moving averages of model test accuracy (71.4% vs. 62.6%).… view at source ↗
Figure 4
Figure 4. Figure 4: Prompts for the controlled RL experiment in §2.6, which are identically used across the two comparison groups. These prompts are fed as context into a Qwen3-8B model checkpoint that parametrizes the RL policy. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompts for the LLM audit agent executing the Phase 1 workflow (§2.3.) 26 [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompts for the LLM relabeling agent executing the Phase 2 workflow (§2.4). Note that unlike [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Benchmark performance of four Claude-series models was reported by Anthropic in Exhibit #1 of their Jan 11, 2026 press release.21 Screenshot taken on Jan 17, 2026. We configured our API call pipeline to match the setup described in the first two bullet-point footnotes. MedA￾gentBench [48] is unrelated to our study. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: MedCalc-Bench accuracy comparison between Anthropic’s official results (shown in [PITH_FULL_IMAGE:figures/full_fig_p045_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: API call prompt template, applied identically to both treatment groups in §G.1. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_9.png] view at source ↗
read the original abstract

Reference labels for machine-learning benchmarks are increasingly synthesized with LLM assistance, but their reliability remains underexamined. We audit MedCalc-Bench, a clinical benchmark for medical score computation whose labels were partly derived with LLM assistance, and develop a scalable physician-in-the-loop stewardship pipeline to reassess them. At least 27% of test labels are likely erroneous or incomputable. On a 50-instance subset validated by physicians, our recomputed labels agree with physician ground truth 74% of the time (95% CI, 60-84%) versus 20% for the originals (95% CI, 11-33%). Using original labels to evaluate frontier LLMs underestimates accuracy by 16-23 percentage points. In a controlled reinforcement-learning experiment, a model trained on recomputed labels outperforms one trained on originals by 13.5 percentage points (95% CI, 10.6-16.6%) on physician-labeled instances, and this advantage extends to related medical tasks. LLM-assisted benchmarks can propagate systematic errors into both evaluation and post-training unless actively stewarded.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper audits MedCalc-Bench, a clinical benchmark whose labels were partly LLM-assisted, and introduces a physician-in-the-loop stewardship pipeline. It reports that at least 27% of test labels are erroneous or incomputable; on a 50-instance physician-validated subset, recomputed labels agree with physician ground truth 74% of the time (95% CI 60-84%) versus 20% (95% CI 11-33%) for the originals. This leads to 16-23pp underestimation of frontier LLM accuracy, and a controlled RL experiment shows models trained on recomputed labels outperform those trained on originals by 13.5pp (95% CI 10.6-16.6%) on physician-labeled instances, with the advantage extending to related tasks.

Significance. If the central findings hold, the work is significant for AI-assisted clinical benchmarks because it supplies direct, quantitative evidence (agreement rates with CIs and a controlled RL delta) that unverified LLM labels can propagate systematic errors into both evaluation and post-training. The physician-in-the-loop pipeline is presented as scalable and reproducible, offering a concrete template for stewardship that could raise standards in medical ML benchmarks.

major comments (3)
  1. [Physician-validated subset (results section describing the 50-instance audit)] The selection process for the 50-instance physician-validated subset is unspecified (random sampling? stratification? criteria favoring recomputable cases?). This is load-bearing for the headline claim of at least 27% erroneous labels on the full test set, because the extrapolation and the reported 74% vs 20% gap both rest on the subset being representative.
  2. [Physician validation procedure (methods and results on ground-truth construction)] Physician labels are used as ground truth for the agreement rates and RL evaluation without any inter-rater reliability metric (e.g., pairwise agreement or kappa). Single-rater variability could inflate or deflate the 74% vs 20% difference and the 13.5pp RL advantage, both of which are central to the stewardship argument.
  3. [Results on label error rate (the paragraph reporting the 27% statistic)] The 27% figure on the full test set is stated as 'likely erroneous or incomputable' but the exact derivation (direct count, model-based inference, or subset extrapolation) is not detailed enough to assess whether it inherits the same selection bias as the 50-instance sample.
minor comments (2)
  1. [Abstract and results tables] The 95% CIs are reported but the exact statistical method (binomial, bootstrap, etc.) and software used should be stated explicitly for reproducibility.
  2. [RL experiment subsection] The RL experiment description would benefit from a brief diagram or pseudocode showing how the recomputed vs original label sets were used for training and how the physician-labeled test instances were held out.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us strengthen the manuscript. We address each major point below with clarifications and revisions where needed. The physician-in-the-loop pipeline remains a core contribution, and we have improved transparency around sampling, validation, and error-rate derivation without altering the central findings.

read point-by-point responses
  1. Referee: [Physician-validated subset (results section describing the 50-instance audit)] The selection process for the 50-instance physician-validated subset is unspecified (random sampling? stratification? criteria favoring recomputable cases?). This is load-bearing for the headline claim of at least 27% erroneous labels on the full test set, because the extrapolation and the reported 74% vs 20% gap both rest on the subset being representative.

    Authors: The 50-instance subset was drawn via simple random sampling from the full test set using a fixed seed for reproducibility; no stratification or selection favoring recomputable cases was applied. We have revised the Methods section to state this explicitly, including the sampling procedure and seed, so that readers can directly assess representativeness. This clarification supports the reported agreement rates and the extrapolation to the 27% figure on the full set. revision: yes

  2. Referee: [Physician validation procedure (methods and results on ground-truth construction)] Physician labels are used as ground truth for the agreement rates and RL evaluation without any inter-rater reliability metric (e.g., pairwise agreement or kappa). Single-rater variability could inflate or deflate the 74% vs 20% difference and the 13.5pp RL advantage, both of which are central to the stewardship argument.

    Authors: We acknowledge that the use of a single board-certified physician as ground truth is a limitation, as inter-rater reliability was not measured. The physician followed a standardized protocol based on published clinical guidelines and was blinded to original labels. We have added an explicit limitations paragraph discussing single-rater variability and its potential impact on the observed gaps, while noting that the magnitude of the 74% vs 20% difference makes systematic bias unlikely to reverse the direction of the findings. Future extensions could include multiple raters. revision: partial

  3. Referee: [Results on label error rate (the paragraph reporting the 27% statistic)] The 27% figure on the full test set is stated as 'likely erroneous or incomputable' but the exact derivation (direct count, model-based inference, or subset extrapolation) is not detailed enough to assess whether it inherits the same selection bias as the 50-instance sample.

    Authors: The 27% figure is obtained by direct enumeration over the entire test set: we flagged every label that was either incomputable (missing required inputs) or violated the original MedCalc computation rules. It is not an extrapolation from the 50-instance subset and therefore does not inherit its sampling properties. We have expanded the Results section with the precise counting procedure, a breakdown by error type, and a statement confirming the full-set scope. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external physician validation

full rationale

The paper's central results (≥27% erroneous labels, 74% vs 20% agreement, 13.5pp RL gain) are obtained by direct recomputation and comparison against independent physician-provided ground truth on a 50-instance subset. These are empirical measurements against an external reference rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations or ansatzes are invoked that collapse the output to the input by construction. The derivation chain is therefore self-contained against the physician-labeled instances.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that physician labels serve as ground truth and that the 50-instance sample is representative; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Physician-provided labels constitute reliable ground truth for the audited instances
    Invoked when reporting agreement rates between recomputed labels and physician judgments

pith-pipeline@v0.9.0 · 5514 in / 1191 out tokens · 38225 ms · 2026-05-16T20:19:33.048964+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 5 internal anchors

  1. [1]

    Estimating the attributable cost of physician burnout in the United States

    Han S, Shanafelt TD, Sinsky CA, et al. Estimating the attributable cost of physician burnout in the United States. Ann Intern Med 2019;170:784–90.doi:10.7326/M18- 1422

  2. [2]

    Health CareExpendituresAttributabletoPrimaryCarePhysicianOverallandBurnout-Related Turnover: A Cross-sectional Analysis

    Sinsky CA, Shanafelt TD, Dyrbye LN, Sabety AH, Carlasare LE, and West CP. Health CareExpendituresAttributabletoPrimaryCarePhysicianOverallandBurnout-Related Turnover: A Cross-sectional Analysis. Mayo Clin Proc 2022;97:693–702.doi:10.1016/ j.mayocp.2021.09.013

  3. [3]

    Evaluating the Costs of Nurse Burnout-Attributed Turnover: A Markov Modeling Approach

    Muir KJ, Wanchek TN, Lobo JM, and Keim-Malpass J. Evaluating the Costs of Nurse Burnout-Attributed Turnover: A Markov Modeling Approach. J Patient Saf 2022;18:351–7.doi:10.1097/PTS.0000000000000920

  4. [4]

    Predicting Primary Care Physician Burnout From Electronic Health Record Use Measures

    Tawfik D, Bayati M, Liu J, et al. Predicting Primary Care Physician Burnout From Electronic Health Record Use Measures. Mayo Clin Proc 2024;99:1411–21.doi:10. 1016/j.mayocp.2024.01.005

  5. [5]

    Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation

    Tierney AA, Gayre G, Hoberman B, et al. Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation. NEJM Catal Innov Care Deliv 2024;5:0404.doi:10.1056/CAT.23.0404

  6. [6]

    A pragmatic randomized controlled trial of ambient artificial intelligence to improve health practitioner well-being

    Afshar M, Ryan Baumann M, Resnik F, et al. A pragmatic randomized controlled trial of ambient artificial intelligence to improve health practitioner well-being. NEJM AI 2025;2:AIoa2500945.doi:10.1056/AIoa2500945

  7. [7]

    Large language models for preventing medication direction errors in online pharmacies

    Pais C, Liu J, Voigt R, Gupta V, Wade E, and Bayati M. Large language models for preventing medication direction errors in online pharmacies. Nat Med 2024;30:1574–82. doi:10.1038/s41591-024-02933-8

  8. [8]

    AI-based clinical decision support for primary care: A real-world study ,

    Korom R, Kiptinness S, Adan N, et al. AI-based Clinical Decision Support for Primary Care: A Real-World Study. arXiv preprint 2025.doi:10.48550/arXiv.2507.16947

  9. [9]

    NEJM AI 2025;3:AIdbp2500418.doi:10.1056/AIdbp2500418

    ChungP,SwaminathanA,GoodellAJ,etal.VerifyingFactsinPatientCareDocuments Generated by Large Language Models Using Electronic Health Records. NEJM AI 2025;3:AIdbp2500418.doi:10.1056/AIdbp2500418

  10. [10]

    Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the com- munity

    Van Walraven C, Dhalla IA, Bell C, et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the com- munity. Can Med Assoc J 2010;182:551–7.doi:10.1503/cmaj.091117. 19

  11. [11]

    Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study

    Lim W, Eerden M van der, Laing R, et al. Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study. Thorax 2003;58:377–82.doi:10.1136/thorax.58.5.377

  12. [12]

    Lip GY, Nieuwlaat R, Pisters R, Lane DA, and Crijns HJ. Refining clinical risk stratifi- cation for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the Euro Heart Survey on atrial fibrillation. Chest 2010;137:263– 72.doi:10.1378/chest.09-1584

  13. [13]

    MDCalc.FrequentlyAskedQuestions.https://web.archive.org/web/20240405155011/ https://www.mdcalc.com/faq. 2024

  14. [14]

    VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

    Khandekar N, Jin Q, Xiong G, et al. MedCalc-Bench: Evaluating Large Language Mod- els for Medical Calculations. In:Advances in Neural Information Processing Systems. Vol. 37. (NeurIPS 2024 Datasets and Benchmark Track Oral). 2024:84730–45.doi: 10.52202/079017- 2690.url:https://proceedings.neurips.cc/paper_files/ paper / 2024 / file / 99e81750f3fdfcaf9613d...

  15. [15]

    Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802,

    Bedi S, Cui H, Fuentes M, et al. MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks. arXiv preprint 2025.doi:10.48550/arXiv.2505.23802

  16. [16]

    RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction

    Liu F, Wu J, Zhou H, et al. RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction. medRxiv preprint 2025.doi:10.1101/2025.04.03.25323489

  17. [17]

    Training LLMs for EHR-Based Reasoning Tasks via Rein- forcement Learning

    Lin J, Wu Z, and Sun J. Training LLMs for EHR-Based Reasoning Tasks via Rein- forcement Learning. arXiv preprint 2025.doi:10.48550/arXiv.2505.24105

  18. [18]

    Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning

    Zhang X, Wang Y, Feng Z, et al. Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning. arXiv preprint 2025.doi:10.48550/ arXiv.2506.12307

  19. [19]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Wang B, Xia I, Zhang Y, et al. From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations. In:Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing. Suzhou, China: Associ- ation for Computational Linguistics, 2025:10820–44.doi:10.18653/v1/2025.emnlp- main.548

  20. [20]

    Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs

    Oh G, Kim S, Park S, and Kim BH. Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs. arXiv preprint 2025.doi:10. 48550/arXiv.2506.13102. 20

  21. [21]

    Advancing Claude in healthcare and the life sciences

    Anthropic. Advancing Claude in healthcare and the life sciences. Anthropic (News / Announcements). Archived by the Internet Archive Wayback Machine. See Figure 1 of their press release. 2026.url:https://web.archive.org/web/20260112115216/ https://www.anthropic.com/news/healthcare-life-sciences

  22. [22]

    The Briefing: Healthcare and Life Sciences

    Anthropic. The Briefing: Healthcare and Life Sciences. YouTube video (livestream / on-demand recording). At 23:09-24:16, the presenter discusses Claude model series’ per- formance on MedCalc-Bench and MedAgentBench. 2026.url:https://www.youtube. com/watch?v=UXyVMGAFLAs(visited on 01/17/2026)

  23. [23]

    Accuracy and efficiency of drilling trajectories with augmented reality versus conventional navigation randomized crossover trial,

    Goodell AJ, Chu SN, Rouholiman D, and Chu LF. Large language model agents can use tools to perform clinical calculations. npj Digit Med 2025;8:163.doi:10.1038/s41746- 025-01475-8

  24. [24]

    Development of an LLM Pipeline Sur- passing Physicians in Cardiovascular Risk Score Calculation

    Roeschl T, Hoffmann M, Unbehaun A, et al. Development of an LLM Pipeline Sur- passing Physicians in Cardiovascular Risk Score Calculation. medRxiv 2025:2025–11. doi:10.1101/2025.11.11.25340002

  25. [25]

    Training language models to follow instructions with human feedback

    Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. In:Advances in neural information processing systems. Vol. 35. 2022:27730–44

  26. [26]

    Back to Basics: Revisiting REINFORCE- Style Optimization for Learning from Human Feedback in LLMs

    Ahmadian A, Cremer C, Gallé M, et al. Back to Basics: Revisiting REINFORCE- Style Optimization for Learning from Human Feedback in LLMs. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024:12248–67

  27. [27]

    Experiment with parameter values | Generative AI on Vertex AI

    Google Cloud. Experiment with parameter values | Generative AI on Vertex AI. Ac- cessed: 2025-12-30. 2025.url:https : / / docs . cloud . google . com / vertex - ai / generative-ai/docs/learn/prompts/adjust-parameter-values

  28. [28]

    Long-Range Forecasting: From Crystal Ball to Computer

    Armstrong JS. Long-Range Forecasting: From Crystal Ball to Computer. 2nd. New York: John Wiley & Sons, 1985.url:https://ssrn.com/abstract=666990

  29. [29]

    Qwen3 Technical Report

    Qwen. Qwen3 Technical Report. 2025. arXiv:2505.09388 [cs.CL].url:https:// web.archive.org/web/20250829140319/https://huggingface.co/Qwen/Qwen3-8B

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao Z, Wang P, Zhu Q, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 2024

  31. [31]

    Calculating Total Daily Dose of Opioids for Safer Dosage

    CDC. Calculating Total Daily Dose of Opioids for Safer Dosage. U.S. Department of Health and Human Services. Archived from the original on November 25, 2022. 2016. url:https://web.archive.org/web/20221125005318/https://www.cdc.gov/ drugoverdose/pdf/calculating_total_daily_dose-a.pdf. 21

  32. [32]

    CDC Clinical Practice GuidelineforPrescribingOpioidsforPain—UnitedStates,2022.2022.doi:10.15585/ mmwr.rr7103a1.url:https://www.cdc.gov/mmwr/volumes/71/rr/rr7103a1.htm

    Dowell D, Ragan KR, Jones CM, Baldwin GT, and Chou R. CDC Clinical Practice GuidelineforPrescribingOpioidsforPain—UnitedStates,2022.2022.doi:10.15585/ mmwr.rr7103a1.url:https://www.cdc.gov/mmwr/volumes/71/rr/rr7103a1.htm

  33. [33]

    PDMP Morphine Milligram Equivalents Fact Sheet

    Maryland Department of Health. PDMP Morphine Milligram Equivalents Fact Sheet. Prescription Drug Monitoring Program. Archived from the original on August 31, 2025. 2025.url:https : / / web . archive . org / web / 20250831211406 / https : / / health . maryland.gov/pdmp/Documents/Clinical%20Docs/MME%20Fact%20Sheet.pdf

  34. [34]

    Opioid Oral Morphine Milligram Equivalent (MME) Conversion Factors

    Utah Department of Health and Human Services. Opioid Oral Morphine Milligram Equivalent (MME) Conversion Factors. Utah Medicaid. Archived from the original on August 1, 2025. 2025.url:https://web.archive.org/web/20250801184557/https: //medicaid-documents.dhhs.utah.gov/Documents/files/Opioid-Morphine-EQ- Conversion-Factors.pdf

  35. [35]

    Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

    Gerstgrasser M, Schaeffer R, Dey A, et al. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. In:First Conference on Language Modeling. 2024.url:https://openreview.net/forum?id=5B2K4LRgmz

  36. [36]

    AI models collapse when trained on recursively generated data

    Shumailov I, Shumaylov Z, Zhao Y, Papernot N, Anderson R, and Gal Y. AI models collapse when trained on recursively generated data. Nature 2024;631:755–9.doi:10. 1038/s41586-024-07566-y

  37. [37]

    A preliminary study of o1 in medicine: Are we closer to an ai doctor?

    Xie Y, Wu J, Tu H, et al. A preliminary study of o1 in medicine: Are we closer to an ai doctor? arXiv preprint arXiv:2409.15277 2024

  38. [38]

    General scales unlock ai evaluation with explanatory and predictive power

    Zhou L, Pacchiardi L, Martínez-Plumed F, et al. General scales unlock ai evaluation with explanatory and predictive power. arXiv preprint arXiv:2503.06378 2025

  39. [39]

    Large Language Model Agents for Biomedicine: A Comprehen- sive Review of Methods, Evaluations, Challenges, and Future Directions

    Xu X and Sankar R. Large Language Model Agents for Biomedicine: A Comprehen- sive Review of Methods, Evaluations, Challenges, and Future Directions. Information 2025;16:894

  40. [40]

    Simple statistical gradient-following algorithms for connectionist rein- forcement learning

    Williams RJ. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Machine learning 1992;8:229–56

  41. [41]

    Policy gradient methods for rein- forcement learning with function approximation

    Sutton RS, McAllester D, Singh S, and Mansour Y. Policy gradient methods for rein- forcement learning with function approximation. Advances in neural information pro- cessing systems 1999;12

  42. [42]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    Schulman J, Moritz P, Levine S, Jordan M, and Abbeel P. High-dimensional contin- uous control using generalized advantage estimation. In:International Conference on Learning Representations (ICLR). 2016.url:https://arxiv.org/abs/1506.02438. 22

  43. [43]

    Buy 4 REINFORCE Samples, Get a Baseline for Free! ICLR 2019 Workshop drlStructPred

    Kool W, Hoof H van, and Welling M. Buy 4 REINFORCE Samples, Get a Baseline for Free! ICLR 2019 Workshop drlStructPred. 2019.url:https://openreview.net/ forum?id=r1lgTGL5DE

  44. [44]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo D, Yang D, Zhang H, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 2025

  45. [45]

    Training Language Models to Self-Correct via Reinforcement Learning

    Kumar A, Zhuang V, Agarwal R, et al. Training Language Models to Self-Correct via Reinforcement Learning. In:The Thirteenth International Conference on Learning Representations. 2025.url:https://openreview.net/forum?id=CjwERcAU7w

  46. [46]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Liu Z, Chen C, Li W, et al. Understanding r1-zero-like training: A critical perspective. arXiv preprint 2025.doi:10.48550/arXiv.2503.20783

  47. [47]

    Hybridflow: A flexible and efficient rlhf framework

    Sheng G, Zhang C, Ye Z, et al. Hybridflow: A Flexible and Efficient RLHF Frame- work. In:Proceedings of the Twentieth European Conference on Computer Systems. 2025:1279–97.doi:10.1145/3689031.3696075

  48. [48]

    unknown" for patient notes lacking necessary information. Note that

    Jiang Y, Black KC, Geng G, et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI 2025;2:AIdbp2500144.doi:10 . 1056 / AIdbp2500144. 23 Supplementary Appendix Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight Junze (Tony) Ye, BS1,*; Daniel Tawfik, MD, MS2; Alex J. Goodell, MD, MS3; Nikhil ...

  49. [49]

    MedCalc-Bench- Verified

    reexamined MedCalc-Bench by introducing an LLM-judged, stepwise evaluation pipeline to grade an LLM’s intermediate calculation steps at test time, which can be thought of as providing process-level rewards besides a binary final answer correctness. Although they recommended to remove 10.3% of the original test instances based on a clinician review, they o...

  50. [50]

    72M admitted to ED for CHF exac- erbation. Hospital stay was 4 days. Hx of diabetes (CCI=2). Visited the ED twice in last 5 months

    real numbers, e.g. blood pressure, Likert scale response; 31 Table 2:Mapping Formalism to a Concrete Example: The LACE Score10 Mathematical Formal LACE Score for Readmission Symbol Definition Example C Patient context (EHR notes and lab data) “72M admitted to ED for CHF exac- erbation. Hospital stay was 4 days. Hx of diabetes (CCI=2). Visited the ED twice...

  51. [51]

    categorical values, ranging from True/False to symptom groups—we call this a setC

  52. [52]

    In sum,xi∈Xi :=R∪C∪{N/A}

    a fallback undefined token denoted byN/A, indicating unextractability of a feature. In sum,xi∈Xi :=R∪C∪{N/A}. LetX≜∏m i=1Xi denote the feature space, a Cartesian product of individual per-feature domains. Lastly, for each(C,q)letybeanyanswer pre- dicted by an algorithm, andy∗be the (latent) clinically correct ground truth answer;y∗is a scalar if the quest...

  53. [53]

    Therearealtogether 10,053 + 1,047 = 11,100 notes in the train and test sets

    Patient notes are real, deidentified patient cases they scraped from journal-published medicalpapersthatarearchivedinthePubMedCentraldatabase. Therearealtogether 10,053 + 1,047 = 11,100 notes in the train and test sets. We note that each of these patient notes can be seen as a unique contextCin the notation of §D.1

  54. [54]

    What is the patient’s Glasgow Coma Score?

    The medical score questions (q’s), e.g. “What is the patient’s Glasgow Coma Score?”, are sourced as the 55 unique calculators listed as “popular” on MDCalc.com, a web app widely used by U.S. physicians. One can think of each train or test instance in the MedCalc-Bench dataset as uniquely identified by a pair(C,q). Many instances may share the same score q...

  55. [55]

    Read the patient note carefully and extract all relevant clinical values

  56. [56]

    Identify the appropriate medical formula or calculation method needed

  57. [57]

    Show your work step by step

    Use Python code to perform the calculation accurately. Show your work step by step

  58. [58]

    After completing your calculation, provide your final numerical answer

  59. [59]

    Your final answer MUST be enclosed in <answer></answer> tags. For example, if your calculated answer is 42.5, you would write: <answer>42.5</answer> Now solve the problem: Figure 9:API call prompt template, applied identically to both treatment groups in §G.1. 46