HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine
Pith reviewed 2026-05-22 01:08 UTC · model grok-4.3
The pith
HealthCraft reveals frontier language models failing most multi-step emergency medicine safety tasks in a new simulated environment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HealthCraft supplies 195 tasks across six categories graded against 2,255 binary criteria, 515 of them safety-critical, inside a deterministic FHIR R4 simulator. On this benchmark two frontier models record Pass@1 scores of 24.8 percent and 12.6 percent together with safety-failure rates of 27.5 percent and 34.0 percent, collapsing to 1.0 percent and 0.0 percent on multi-step workflows.
What carries the argument
The dual-layer rubric that assigns zero reward for any violation among the 515 safety-critical criteria inside the FHIR R4 environment with 24 exposed tools.
If this is right
- Infrastructure fidelity directly affects measured model rankings, as six bugs fixed between pilot versions reordered which model appeared stronger.
- The reward signal cannot be used as-is for training because restraint criteria appear at 0.929 prevalence and invite gameability.
- Partial competence on single steps does not produce integrated multi-step success, indicating a specific failure mode in sustained clinical pressure.
- The environment includes scaffolding for coupling to training loops such as Megatron plus SGLang plus GRPO.
Where Pith is reading between the lines
- The environment could serve as a testbed for targeted safety fine-tuning before any clinical deployment attempt.
- Similar trajectory-level safety rubrics may be required in other high-stakes sequential domains such as critical infrastructure control.
- The observed gap between step-wise and workflow performance points to a need for training methods that explicitly optimize long-horizon constraint satisfaction.
- Public release under Apache 2.0 enables independent replication and extension by groups focused on medical AI alignment.
Load-bearing premise
The dual-layer rubric and its 2,255 binary criteria capture every safety violation that would matter in actual emergency medicine.
What would settle it
A direct comparison of model trajectories that pass HealthCraft against the same decisions reviewed by emergency physicians in matched real or high-fidelity cases would show whether the simulated safety signal predicts actual clinical risk.
Figures
read the original abstract
Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5-28.4] and GPT-5.4 at 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0%. On multi-step workflows - the closest proxy to real emergency care - performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM-judge overlay bounds evaluator noise, and a 60-run negative-class smoke pilot shows the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training-reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HealthCraft, the first public reinforcement-learning safety environment for emergency medicine. Adapted from Corecraft and built on a FHIR R4 simulator with 14 entity types, 3,987 seed entities, and 24 MCP tools, it defines a dual-layer rubric of 2,255 binary criteria (515 safety-critical) that zeros reward on any violation. The authors release 195 tasks (plus a 10-task negative-class extension) across six categories and report V8 empirical results on two frontier models: Claude Opus 4.6 achieves Pass@1 of 24.8% [21.5-28.4] with 27.5% safety failures, while GPT-5.4 reaches 12.6% [10.2-15.6] with 34.0% safety failures; multi-step workflow performance collapses to 1.0% and 0.0% respectively. The paper also documents six infrastructure bugs that reordered model rankings between pilots, provides a deterministic LLM-judge overlay, and scaffolds future coupling to a Megatron+SGLang+GRPO training loop while releasing all assets under Apache 2.0.
Significance. If the rubric proves clinically valid, HealthCraft would supply a valuable, reproducible benchmark for trajectory-level safety failures that static medical QA datasets miss, directly supporting safer deployment of LLMs in high-stakes emergency care. The explicit release of the full environment, tasks, rubrics, and harness, together with confidence intervals and acknowledgment of infrastructure effects on rankings, constitutes a concrete community resource that strengthens the work's utility beyond the reported numbers.
major comments (2)
- [Rubric construction (abstract and §3–4)] Rubric construction (abstract and §3–4): the 2,255 binary criteria (515 safety-critical) are derived internally from the 14 entity types and 24 MCP tools with no reported clinician review, inter-rater reliability, or correlation against real-world adverse events. Because the headline safety-failure rates (27.5%, 34.0%) and the multi-step collapse claim rest entirely on these criteria zeroing reward, the absence of external clinical grounding is load-bearing for interpreting the results as evidence of model unsafety in actual emergency medicine.
- [Negative-class pilot (abstract)] Negative-class pilot (abstract): the 60-run smoke test reports restraint criteria passing at 0.929 prevalence, which the authors correctly flag as gameable for training rewards. Given that the manuscript scaffolds a Megatron+SGLang+GRPO training loop, this observation directly affects whether the current rubric can be used as a training signal without additional safeguards or redesign.
minor comments (2)
- [Abstract] The abstract states that six infrastructure bugs were fixed between v2 and v8 and reordered model rankings; a concise table listing the bugs and their quantitative impact on Pass@1 and safety-failure rates would improve reproducibility and transparency.
- [Results] Multi-step workflow results are reported as 1.0% and 0.0%; the exact number of such workflows, the precise definition of “multi-step,” and the per-step competence breakdown should be stated explicitly to allow readers to assess partial competence claims.
Simulated Author's Rebuttal
Thank you for the constructive review. We address each major comment below and indicate where the manuscript will be revised.
read point-by-point responses
-
Referee: Rubric construction (abstract and §3–4): the 2,255 binary criteria (515 safety-critical) are derived internally from the 14 entity types and 24 MCP tools with no reported clinician review, inter-rater reliability, or correlation against real-world adverse events. Because the headline safety-failure rates (27.5%, 34.0%) and the multi-step collapse claim rest entirely on these criteria zeroing reward, the absence of external clinical grounding is load-bearing for interpreting the results as evidence of model unsafety in actual emergency medicine.
Authors: We agree this is a substantive limitation. The criteria were derived by enumerating safety-critical conditions from the FHIR R4 entity schemas and MCP tool contracts to ensure they are executable inside the simulator. We will add a dedicated limitations subsection in the revised manuscript that states the absence of clinician review or inter-rater reliability and describes a planned follow-up validation study with emergency physicians. We will also qualify the interpretation of the reported safety-failure rates to emphasize that they reflect performance against the current internal rubric rather than proven real-world harm. revision: partial
-
Referee: Negative-class pilot (abstract): the 60-run smoke test reports restraint criteria passing at 0.929 prevalence, which the authors correctly flag as gameable for training rewards. Given that the manuscript scaffolds a Megatron+SGLang+GRPO training loop, this observation directly affects whether the current rubric can be used as a training signal without additional safeguards or redesign.
Authors: We agree the high prevalence makes the rubric unsuitable for direct use as a training reward. The manuscript already notes that the signal 'is not drop-in training-safe' and defers training ablations to future work. In revision we will expand the negative-class discussion to recommend concrete safeguards (e.g., reweighting or exclusion of high-prevalence criteria) before any GRPO integration and will clarify that the current rubric is intended for evaluation only. revision: partial
- External clinician review, inter-rater reliability metrics, and correlation of the 2,255 criteria against real-world adverse events
Circularity Check
No significant circularity: empirical benchmark release with external evaluations
full rationale
The paper releases HealthCraft as an RL safety environment and benchmark with 195 tasks, 2,255 binary criteria, and a dual-layer rubric on a FHIR R4 simulator. Central results consist of direct empirical measurements—Pass@1 rates, safety-failure rates, and multi-step collapse—obtained by running external models (Claude Opus 4.6, GPT-5.4) against the released harness. No derivation chain, equation, or fitted parameter reduces the reported metrics to the authors' inputs by construction. The single reference to Corecraft supplies only infrastructural scaffolding details and does not load-bear the performance claims, which remain independently verifiable through the public release. This is a standard self-contained benchmark paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- Safety-critical criteria thresholds
axioms (1)
- domain assumption The FHIR R4 world state with 14 entity types and 3,987 seed entities provides a realistic proxy for emergency medicine conditions.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2602.16179v5 , year =
Mehta, Aastha and Ritchie, David and Garre, Nicol\'as and Niebres, Jason and Heiner, Patrick and Chen, Edwin , title =. arXiv preprint arXiv:2602.16179v5 , year =
-
[2]
2019 , howpublished =
work page 2019
-
[3]
Concrete Problems in AI Safety
Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete Problems in. arXiv preprint arXiv:1606.06565 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Altman, Eitan , title =
-
[5]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , title =. arXiv preprint arXiv:2406.12045 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E
Patil, Shishir G. and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E. , title =. 2024 , howpublished =
work page 2024
-
[7]
Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , title =. Applied Sciences , volume =
-
[8]
2025 , howpublished =
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.