arxiv: 2604.07709 · v3 · submitted 2026-04-09 · 💻 cs.AI · cs.CL· cs.CY· cs.LG

Recognition: 2 theorem links

· Lean Theorem

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

David Gringras

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CYcs.LG

keywords AI safetyiatrogenic harmmedical advicewithholdingclinical scenariosomission harmrole framingfrontier models

0 comments

The pith

AI safety training causes frontier models to withhold better clinical guidance from laypeople than from physicians for identical questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether safety measures in AI models lead to systematically different medical advice depending on whether a query is framed as coming from a physician or a layperson. Using sixty pre-registered clinical scenarios scored for commission and omission harm, it finds that all testable models give higher-quality responses, with less harmful withholding, when the user is presented as a doctor. The effect is largest in models with the most safety training and is missed by standard LLM evaluators. This pattern creates measurable risk of harm for non-expert users who have already exhausted conventional medical channels. The authors separate the behavior into distinct failure modes tied to how each model was aligned.

Core claim

Matching the same clinical question in physician versus layperson framing produces better guidance to the physician across all five testable models, with a decoupling gap of +0.38 and a 13.1 percentage point drop in binary hit rates on safety-colliding actions for layperson framing. The gap reaches +0.65 for the model with heaviest safety investment. Three failure modes separate cleanly: trained withholding, incompetence, and indiscriminate content filtering. The standard LLM judge assigns omission harm scores of zero to 73 percent of responses that physicians score as harmful.

What carries the argument

Identity-contingent withholding, measured by presenting identical clinical scenarios in physician versus layperson framing and scoring responses on commission harm (0-3) and omission harm (0-4) axes through a pre-registered, physician-validated pipeline.

If this is right

Safety training can introduce measurable withholding of actionable medical information when the user is not framed as an expert.
The largest gaps appear in models that received the heaviest safety investment.
Standard automated judges miss omission harms that physicians detect at high rates.
The withholding occurs even in scenarios where the user has already exhausted standard referrals.
Distinct failure modes emerge depending on the specific safety approach used during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar framing effects could appear in other regulated domains where models are trained to restrict advice based on user identity.
Users without medical credentials may need to adopt professional phrasing to receive full guidance, creating an accessibility barrier.
Alternative alignment techniques that avoid role-based filtering could be tested to determine whether the gap can be reduced without losing other safety properties.

Load-bearing premise

That differences in guidance quality are produced by safety training rather than other prompt sensitivities or unmeasured model factors.

What would settle it

Running the identical sixty scenarios on base models that received no safety training and observing no quality gap between physician and layperson framings would falsify the claim that safety measures cause the withholding.

Figures

Figures reproduced from arXiv: 2604.07709 by David Gringras.

**Figure 2.** Figure 2: Per-model decoupling gap (structured evaluation). Positive gap [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The self-reinforcing evaluation blind spot. Models trained to minimise commission harm [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word ("I'm a psychiatrist; a patient presents with...") and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre-registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0-3; omission harm, OH 0-4) through a structured-evaluation pipeline validated against physician scoring (kappa_w = 0.571, within-1 agreement 96%). The central finding is identity-contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety-colliding actions drop 13.1 percentage points in layperson framing, p < 0.0001, while non-colliding actions show no change). The gap is widest for the model with the heaviest safety investment (Opus, +0.65). Three failure modes separate cleanly: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT-5.2, whose post-generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). The standard LLM judge assigns OH = 0 to 73% of responses a physician scores OH >= 1 (kappa = 0.045); the evaluation apparatus has the same blind spot as the training apparatus. Every scenario targets someone who has already exhausted the standard referrals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces IatroBench, a benchmark of 60 pre-registered clinical scenarios (e.g., benzodiazepine tapering) evaluated across six frontier models with 3,600 total responses. Responses are scored on commission harm (CH 0-3) and omission harm (OH 0-4) via a structured pipeline validated against physicians (weighted kappa 0.571, 96% within-1 agreement). The central claim is identity-contingent withholding: identical queries framed as coming from a physician versus a layperson yield better guidance to physicians (decoupling gap +0.38, p=0.003), with a 13.1pp drop in binary hit rates on safety-colliding actions under layperson framing (p<0.0001) while non-colliding actions are unaffected. The gap is largest for the heaviest-safety model (Opus, +0.65); three failure modes are distinguished (trained withholding, incompetence, indiscriminate filtering), and standard LLM judges are shown to disagree with physicians (kappa=0.045).

Significance. If the reported gaps are robustly attributable to safety alignment rather than prompt sensitivity or capability differences, the work would be significant for AI safety research by providing pre-registered, large-N empirical evidence of iatrogenic withholding in medical contexts and a taxonomy of failure modes. Strengths include the pre-registration, scale (3,600 responses), physician validation pipeline, and clean separation of colliding vs. non-colliding actions. The finding that the standard LLM judge misses 73% of physician-flagged OH>=1 cases also highlights a shared blind spot between training and evaluation. However, without ablations isolating safety training, the causal interpretation remains suggestive.

major comments (3)

[Abstract and Results] Abstract and Results: The central claim attributes the +0.38 decoupling gap and 13.1pp hit-rate drop specifically to AI safety measures, yet no ablation is reported that holds base architecture and pre-training fixed while varying only the safety alignment stage. The larger gap for Opus is noted, but without such controls alternative explanations (general professional-framing sensitivity, unmeasured training differences) cannot be excluded.
[Physician validation section] Physician validation section: The reported weighted kappa_w = 0.571 between the structured pipeline and physicians constitutes only moderate agreement. Because the OH scores (and thus the decoupling gap) rest on this pipeline, the moderate agreement introduces potential noise or bias that could affect the magnitude and significance of the reported gaps.
[Methods] Methods: Details on how the 60 scenarios were constructed, how safety-colliding vs. non-colliding actions were pre-registered and classified, and the exact exclusion rules are not fully specified in the provided text. These choices are load-bearing for the claim that the gap is specific to safety-colliding content rather than general prompt effects.

minor comments (2)

[Abstract] Abstract states 'six frontier models' but then refers to 'five testable models'; clarify which model was excluded and why.
[Abstract] The exact formula for the decoupling gap (+0.38) should be stated explicitly (e.g., difference in mean OH or a normalized ratio) rather than left as a summary statistic.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point-by-point below, noting planned revisions and limitations we cannot address.

read point-by-point responses

Referee: Abstract and Results: The central claim attributes the +0.38 decoupling gap and 13.1pp hit-rate drop specifically to AI safety measures, yet no ablation is reported that holds base architecture and pre-training fixed while varying only the safety alignment stage. The larger gap for Opus is noted, but without such controls alternative explanations (general professional-framing sensitivity, unmeasured training differences) cannot be excluded.

Authors: We agree that a controlled ablation isolating safety alignment would provide stronger causal evidence. However, the relevant base models without safety post-training are not available for the frontier systems tested. We instead use cross-model variation in safety investment (largest gap in Opus) and the distinct failure-mode taxonomy (trained withholding vs. incompetence vs. filtering) to support the interpretation. We will revise the Discussion to explicitly state that the attribution remains suggestive and to discuss alternative explanations such as professional-framing sensitivity. revision: partial
Referee: Physician validation section: The reported weighted kappa_w = 0.571 between the structured pipeline and physicians constitutes only moderate agreement. Because the OH scores (and thus the decoupling gap) rest on this pipeline, the moderate agreement introduces potential noise or bias that could affect the magnitude and significance of the reported gaps.

Authors: We report the moderate kappa transparently. The 96% within-1 agreement indicates discrepancies are typically small. We will add a sensitivity analysis in the revision demonstrating that the decoupling gap and hit-rate differences remain statistically significant when using only fully agreed cases or when applying conservative adjustments for potential bias. This will quantify the impact of any noise on the primary findings. revision: yes
Referee: Methods: Details on how the 60 scenarios were constructed, how safety-colliding vs. non-colliding actions were pre-registered and classified, and the exact exclusion rules are not fully specified in the provided text. These choices are load-bearing for the claim that the gap is specific to safety-colliding content rather than general prompt effects.

Authors: The 60 scenarios, colliding/non-colliding classifications, and exclusion rules were defined in the pre-registration. We will expand the Methods section to include the pre-registration link, a summary table of construction criteria (real-world cases with exhausted referrals), explicit definitions and examples distinguishing safety-colliding actions (e.g., specific high-risk dosages) from non-colliding ones, and all exclusion criteria. This will make the specificity to colliding content fully transparent. revision: yes

standing simulated objections not resolved

Direct ablation holding base architecture and pre-training fixed while varying only safety alignment, as the corresponding unsafety-trained frontier base models are not accessible for evaluation.

Circularity Check

0 steps flagged

No circularity: empirical gaps measured directly from controlled outputs

full rationale

The paper reports statistical differences in model responses (decoupling gap +0.38, hit-rate drops) obtained by presenting identical clinical scenarios under two framings to the same models. These are direct measurements from 3,600 generated responses scored via a physician-validated pipeline; no equations, fitted parameters, or predictions are defined in terms of the target gaps. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on pre-registered empirical comparisons rather than any derivation that reduces to its own inputs by construction. External physician scoring (kappa_w = 0.571) further separates the measurement from model-internal artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim rests on the assumption that the structured scoring pipeline validly measures clinical harm and that framing differences isolate safety-induced withholding.

axioms (1)

domain assumption The structured-evaluation pipeline accurately captures clinical omission and commission harm as validated by physician raters (kappa_w = 0.571)
The paper uses this validation to support all reported gaps and failure-mode classifications.

pith-pipeline@v0.9.0 · 5665 in / 1466 out tokens · 81424 ms · 2026-05-10T18:19:06.145062+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The central finding is identity-contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety-colliding actions drop 13.1 percentage points in layperson framing, p < 0.0001)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Goodhart’s Law (Goodhart, 1984) gives this a formal name, and our data supply the empirical content: commission harm draws a large negative reward signal; omission harm, approximately nothing; refusal, a small positive.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 25 canonical work pages · 11 internal anchors

[1]

Claude's New Constitution

Anthropic (2026). Claude's New Constitution. https://www.anthropic.com/news/claude-new-constitution. Accessed March 2026

2026
[2]

Wang, Z. et al. (2025). Evading LLMs' Safety Boundary with Adaptive Role-Play Jailbreaking. Electronics, 14(24):4808

2025
[3]

Arora, A. et al. (2025). HealthBench : Evaluating Large Language Models Towards Improved Human Health. arXiv:2505.08775

work page internal anchor Pith review arXiv 2025
[4]

Ashton, H. (2002). Benzodiazepines: How They Work and How to Withdraw (The Ashton Manual). Newcastle University. https://www.benzo.org.uk/manual/

2002
[5]

Bai, Y. et al. (2022). Constitutional AI : Harmlessness from AI Feedback. arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Bean, A.M., Payne, R.E. et al. (2026). Reliability of LLMs as Medical Assistants for the General Public: A Randomized Preregistered Study. Nature Medicine, 32:609--615

2026
[7]

Chen, S., Gao, M., Sasse, K. et al. (2025). When Helpfulness Backfires: LLMs and the Risk of False Medical Information Due to Sycophantic Behavior. npj Digital Medicine, 8:605

2025
[8]

CDC Clinical Practice Guideline for Prescribing Opioids for Pain--- United States , 2022

Centers for Disease Control and Prevention (2022). CDC Clinical Practice Guideline for Prescribing Opioids for Pain--- United States , 2022. MMWR Recommendations and Reports, 71(3):1--95

2022
[9]

Zhang, Z. et al. (2025). FalseReject : A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs . COLM 2025. arXiv:2505.08054

work page arXiv 2025
[10]

Cui, J. et al. (2024). OR-Bench : An Over-Refusal Benchmark for Large Language Models. ICML 2025. arXiv:2405.20947

work page arXiv 2024
[11]

Dai, J. et al. (2024). Safe RLHF : Safe Reinforcement Learning from Human Feedback. ICLR 2024 (Spotlight). arXiv:2310.12773

work page internal anchor Pith review arXiv 2024
[12]

Dubois, Y. et al. (2024). Length-Controlled AlpacaEval : A Simple Way to Debias Automatic Evaluators. COLM 2024. arXiv:2404.04475

work page internal anchor Pith review arXiv 2024
[13]

FDA Drug Safety Communication: FDA Requiring Boxed Warning Updated to Improve Safe Use of Benzodiazepine Drug Class

Food and Drug Administration (2020). FDA Drug Safety Communication: FDA Requiring Boxed Warning Updated to Improve Safe Use of Benzodiazepine Drug Class. https://www.fda.gov/drugs/drug-safety-and-availability/fda-requiring-boxed-warning-updated-improve-safe-use-benzodiazepine-drug-class

2020
[14]

& Cicchetti, D.V

Feinstein, A.R. & Cicchetti, D.V. (1990). High agreement but low kappa: I . T he problems of two paradoxes. Journal of Clinical Epidemiology, 43(6):543--549

1990
[15]

Gao, L., Schulman, J., & Hilton, J. (2023). Scaling Laws for Reward Model Overoptimization. ICML 2023. arXiv:2210.10760

work page arXiv 2023
[16]

Goodhart, C.A.E. (1984). Problems of Monetary Management: The U.K. Experience. In Monetary Theory and Practice, pp. 91--121. Macmillan

1984
[17]

Han, S. et al. (2024). WildGuard : Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs . NeurIPS 2024 Datasets & Benchmarks. arXiv:2406.18495

work page arXiv 2024
[18]

Jin, D. et al. (2021). What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11(14):6421

2021
[19]

Krakovna, V. et al. (2020). Specification Gaming: The Flip Side of AI Ingenuity. DeepMind Blog. https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/

2020
[20]

Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA : Measuring How Models Mimic Human Falsehoods. ACL 2022

2022
[21]

Categorizing Variants of Goodhart's Law

Manheim, D. & Garrabrant, S. (2019). Categorizing Variants of Goodhart's Law. arXiv:1803.04585

work page Pith review arXiv 2019
[22]

Mazeika, M. et al. (2024). HarmBench : A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. ICML 2024. arXiv:2402.04249

work page internal anchor Pith review arXiv 2024
[23]

Model Spec

OpenAI (2024). Model Spec. https://cdn.openai.com/spec/model-spec-2024-05-08.html. Updated 2025

2024
[24]

From hard refusals to safe-completions: Toward output-centric safety training

OpenAI (2025). From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training. arXiv:2508.09224

work page arXiv 2025
[25]

Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. arXiv:2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Parrish, A. et al. (2022). BBQ : A Hand-Built Bias Benchmark for Question Answering. Findings of ACL 2022

2022
[27]

Perez, E. et al. (2023). Discovering Language Model Behaviors with Model-Written Evaluations. Findings of ACL 2023. arXiv:2212.09251

work page internal anchor Pith review arXiv 2023
[28]

Qi, X. et al. (2025). Safety Alignment Should Be Made More Than Just a Few Tokens Deep. ICLR 2025. arXiv:2406.05946

work page arXiv 2025
[29]

Ramaswamy, A. et al. (2026). ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine. doi:10.1038/s41591-026-04297-7

work page doi:10.1038/s41591-026-04297-7 2026
[30]

R\" o ttger, P. et al. (2024). XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. NAACL 2024. arXiv:2308.01263

work page internal anchor Pith review arXiv 2024
[31]

Sharma, M. et al. (2024). Towards Understanding Sycophancy in Language Models. ICLR 2024. arXiv:2310.13548

work page internal anchor Pith review arXiv 2024
[32]

Singhal, K. et al. (2023). Large Language Models Encode Clinical Knowledge. Nature, 620:172--180

2023
[33]

Studdert, D.M. et al. (2005). Defensive Medicine Among High-Risk Specialist Physicians in a Volatile Malpractice Environment. JAMA, 293(21):2609--2617

2005
[34]

Wang, S. et al. (2025). A Novel Evaluation Benchmark for Medical LLMs : Illuminating Safety and Effectiveness in Clinical Domains. arXiv:2507.23486

work page arXiv 2025
[35]

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail? NeurIPS 2023. arXiv:2307.02483

work page internal anchor Pith review arXiv 2023
[36]

Wu, M. et al. (2025). Style Over Substance: Evaluation Biases for Large Language Models. COLING 2025, pages 297--312. arXiv:2307.03025

work page arXiv 2025
[37]

Xie, T. et al. (2025). SORRY-Bench : Systematically Evaluating Large Language Model Safety Refusal. ICLR 2025. arXiv:2406.14598

work page arXiv 2025
[38]

Yang, Z. et al. (2024). The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models. arXiv:2411.11407

work page arXiv 2024
[39]

Ye, J. et al. (2024). Justice or Prejudice? Quantifying Biases in LLM -as-a-Judge. arXiv:2410.02736

work page arXiv 2024
[40]

Zheng, L. et al. (2023). Judging LLM -as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arXiv:2306.05685

work page internal anchor Pith review arXiv 2023
[41]

Mello, M.M., Chandra, A., Gawande, A.A., & Studdert, D.M. (2010). National Costs of the Medical Liability System. Health Affairs, 29(9):1569--1577

2010
[42]

Wu, D. et al. (2025). First, do NOHARM : Towards Clinically Safe Large Language Models. arXiv:2512.01241

work page arXiv 2025