pith. sign in

arxiv: 2605.21496 · v1 · pith:LCQVOZGEnew · submitted 2026-04-18 · 💻 cs.LG · cs.AI· cs.CL

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

Pith reviewed 2026-05-22 01:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learningsafety evaluationemergency medicinelanguage modelsFHIRtrajectory safetyclinical workflowsbenchmark environment
0
0 comments X

The pith

HealthCraft reveals frontier language models failing most multi-step emergency medicine safety tasks in a new simulated environment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HealthCraft as the first public reinforcement learning environment built to reward trajectory-level safety in realistic emergency medicine conditions. It adapts a FHIR R4 world state containing 14 entity types and nearly 4,000 seed entities, exposes 24 tools, and applies a dual-layer rubric that immediately zeros reward on any safety-critical violation. Evaluation across 195 tasks shows Claude Opus 4.6 achieving 24.8 percent Pass@1 and GPT-5.4 achieving 12.6 percent, with safety-failure rates of 27.5 percent and 34.0 percent respectively. Performance drops to near zero on multi-step workflows that best approximate real clinical care, even when models handle isolated steps adequately. The work also notes that infrastructure changes reordered model rankings between versions and that the reward signal contains gameable elements unsuitable for direct training use.

Core claim

HealthCraft supplies 195 tasks across six categories graded against 2,255 binary criteria, 515 of them safety-critical, inside a deterministic FHIR R4 simulator. On this benchmark two frontier models record Pass@1 scores of 24.8 percent and 12.6 percent together with safety-failure rates of 27.5 percent and 34.0 percent, collapsing to 1.0 percent and 0.0 percent on multi-step workflows.

What carries the argument

The dual-layer rubric that assigns zero reward for any violation among the 515 safety-critical criteria inside the FHIR R4 environment with 24 exposed tools.

If this is right

  • Infrastructure fidelity directly affects measured model rankings, as six bugs fixed between pilot versions reordered which model appeared stronger.
  • The reward signal cannot be used as-is for training because restraint criteria appear at 0.929 prevalence and invite gameability.
  • Partial competence on single steps does not produce integrated multi-step success, indicating a specific failure mode in sustained clinical pressure.
  • The environment includes scaffolding for coupling to training loops such as Megatron plus SGLang plus GRPO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The environment could serve as a testbed for targeted safety fine-tuning before any clinical deployment attempt.
  • Similar trajectory-level safety rubrics may be required in other high-stakes sequential domains such as critical infrastructure control.
  • The observed gap between step-wise and workflow performance points to a need for training methods that explicitly optimize long-horizon constraint satisfaction.
  • Public release under Apache 2.0 enables independent replication and extension by groups focused on medical AI alignment.

Load-bearing premise

The dual-layer rubric and its 2,255 binary criteria capture every safety violation that would matter in actual emergency medicine.

What would settle it

A direct comparison of model trajectories that pass HealthCraft against the same decisions reviewed by emergency physicians in matched real or high-fidelity cases would show whether the simulated safety signal predicts actual clinical risk.

Figures

Figures reproduced from arXiv: 2605.21496 by Brandon Dent.

Figure 1
Figure 1. Figure 1: HealthCraft architecture. A frontier LLM agent issues MCP tool calls against a FHIR-R4 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Entity graph. Fourteen FHIR-R4 entity types with 3,987 entities at seed=42. OpenEM’s [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: V8 Pass@1 by task category with Wilson 95% CIs. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pass@1 and mean reward across pilots v3–v8. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-task safety-gate dominance. Each point is one task. The band along the top-left (high [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5-28.4] and GPT-5.4 at 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0%. On multi-step workflows - the closest proxy to real emergency care - performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM-judge overlay bounds evaluator noise, and a 60-run negative-class smoke pilot shows the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training-reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HealthCraft, the first public reinforcement-learning safety environment for emergency medicine. Adapted from Corecraft and built on a FHIR R4 simulator with 14 entity types, 3,987 seed entities, and 24 MCP tools, it defines a dual-layer rubric of 2,255 binary criteria (515 safety-critical) that zeros reward on any violation. The authors release 195 tasks (plus a 10-task negative-class extension) across six categories and report V8 empirical results on two frontier models: Claude Opus 4.6 achieves Pass@1 of 24.8% [21.5-28.4] with 27.5% safety failures, while GPT-5.4 reaches 12.6% [10.2-15.6] with 34.0% safety failures; multi-step workflow performance collapses to 1.0% and 0.0% respectively. The paper also documents six infrastructure bugs that reordered model rankings between pilots, provides a deterministic LLM-judge overlay, and scaffolds future coupling to a Megatron+SGLang+GRPO training loop while releasing all assets under Apache 2.0.

Significance. If the rubric proves clinically valid, HealthCraft would supply a valuable, reproducible benchmark for trajectory-level safety failures that static medical QA datasets miss, directly supporting safer deployment of LLMs in high-stakes emergency care. The explicit release of the full environment, tasks, rubrics, and harness, together with confidence intervals and acknowledgment of infrastructure effects on rankings, constitutes a concrete community resource that strengthens the work's utility beyond the reported numbers.

major comments (2)
  1. [Rubric construction (abstract and §3–4)] Rubric construction (abstract and §3–4): the 2,255 binary criteria (515 safety-critical) are derived internally from the 14 entity types and 24 MCP tools with no reported clinician review, inter-rater reliability, or correlation against real-world adverse events. Because the headline safety-failure rates (27.5%, 34.0%) and the multi-step collapse claim rest entirely on these criteria zeroing reward, the absence of external clinical grounding is load-bearing for interpreting the results as evidence of model unsafety in actual emergency medicine.
  2. [Negative-class pilot (abstract)] Negative-class pilot (abstract): the 60-run smoke test reports restraint criteria passing at 0.929 prevalence, which the authors correctly flag as gameable for training rewards. Given that the manuscript scaffolds a Megatron+SGLang+GRPO training loop, this observation directly affects whether the current rubric can be used as a training signal without additional safeguards or redesign.
minor comments (2)
  1. [Abstract] The abstract states that six infrastructure bugs were fixed between v2 and v8 and reordered model rankings; a concise table listing the bugs and their quantitative impact on Pass@1 and safety-failure rates would improve reproducibility and transparency.
  2. [Results] Multi-step workflow results are reported as 1.0% and 0.0%; the exact number of such workflows, the precise definition of “multi-step,” and the per-step competence breakdown should be stated explicitly to allow readers to assess partial competence claims.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive review. We address each major comment below and indicate where the manuscript will be revised.

read point-by-point responses
  1. Referee: Rubric construction (abstract and §3–4): the 2,255 binary criteria (515 safety-critical) are derived internally from the 14 entity types and 24 MCP tools with no reported clinician review, inter-rater reliability, or correlation against real-world adverse events. Because the headline safety-failure rates (27.5%, 34.0%) and the multi-step collapse claim rest entirely on these criteria zeroing reward, the absence of external clinical grounding is load-bearing for interpreting the results as evidence of model unsafety in actual emergency medicine.

    Authors: We agree this is a substantive limitation. The criteria were derived by enumerating safety-critical conditions from the FHIR R4 entity schemas and MCP tool contracts to ensure they are executable inside the simulator. We will add a dedicated limitations subsection in the revised manuscript that states the absence of clinician review or inter-rater reliability and describes a planned follow-up validation study with emergency physicians. We will also qualify the interpretation of the reported safety-failure rates to emphasize that they reflect performance against the current internal rubric rather than proven real-world harm. revision: partial

  2. Referee: Negative-class pilot (abstract): the 60-run smoke test reports restraint criteria passing at 0.929 prevalence, which the authors correctly flag as gameable for training rewards. Given that the manuscript scaffolds a Megatron+SGLang+GRPO training loop, this observation directly affects whether the current rubric can be used as a training signal without additional safeguards or redesign.

    Authors: We agree the high prevalence makes the rubric unsuitable for direct use as a training reward. The manuscript already notes that the signal 'is not drop-in training-safe' and defers training ablations to future work. In revision we will expand the negative-class discussion to recommend concrete safeguards (e.g., reweighting or exclusion of high-prevalence criteria) before any GRPO integration and will clarify that the current rubric is intended for evaluation only. revision: partial

standing simulated objections not resolved
  • External clinician review, inter-rater reliability metrics, and correlation of the 2,255 criteria against real-world adverse events

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark release with external evaluations

full rationale

The paper releases HealthCraft as an RL safety environment and benchmark with 195 tasks, 2,255 binary criteria, and a dual-layer rubric on a FHIR R4 simulator. Central results consist of direct empirical measurements—Pass@1 rates, safety-failure rates, and multi-step collapse—obtained by running external models (Claude Opus 4.6, GPT-5.4) against the released harness. No derivation chain, equation, or fitted parameter reduces the reported metrics to the authors' inputs by construction. The single reference to Corecraft supplies only infrastructural scaffolding details and does not load-bear the performance claims, which remain independently verifiable through the public release. This is a standard self-contained benchmark paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The benchmark depends on author-defined safety criteria and simulation fidelity rather than derived quantities; no free parameters are fitted to target results in the reported evaluations.

free parameters (1)
  • Safety-critical criteria thresholds
    The 515 safety-critical criteria out of 2,255 are defined by authors to zero reward on violations; these are constructed rather than derived from external data.
axioms (1)
  • domain assumption The FHIR R4 world state with 14 entity types and 3,987 seed entities provides a realistic proxy for emergency medicine conditions.
    Invoked to justify the environment's relevance to real clinical workflows.

pith-pipeline@v0.9.0 · 5911 in / 1368 out tokens · 64155 ms · 2026-05-22T01:08:01.981251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    arXiv preprint arXiv:2602.16179v5 , year =

    Mehta, Aastha and Ritchie, David and Garre, Nicol\'as and Niebres, Jason and Heiner, Patrick and Chen, Edwin , title =. arXiv preprint arXiv:2602.16179v5 , year =

  2. [2]

    2019 , howpublished =

  3. [3]

    Concrete Problems in AI Safety

    Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete Problems in. arXiv preprint arXiv:1606.06565 , year =

  4. [4]

    Altman, Eitan , title =

  5. [5]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik , title =. arXiv preprint arXiv:2406.12045 , year =

  6. [6]

    and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E

    Patil, Shishir G. and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E. , title =. 2024 , howpublished =

  7. [7]

    Applied Sciences , volume =

    Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , title =. Applied Sciences , volume =

  8. [8]

    2025 , howpublished =