First, do NOHARM: towards clinically safe large language models

Adam Rodman; Adi Badhwar; Advait Patil; Allen Shih; Anastasia Perez; Anup Agarwal; April S. Liang; Arjun K. Manrai; Arjun Rustagi; Arnold Milstein

arxiv: 2512.01241 · v3 · pith:IGCN6APJnew · submitted 2025-12-01 · 💻 cs.CY · cs.AI

First, do NOHARM: towards clinically safe large language models

David Wu , Fateme Nateghi Haredasht , Saloni Kumar Maharaj , Priyank Jain , Jessica Tran , Matthew Gwiazdon , Arjun Rustagi , Jenelle Jindal

show 49 more authors

Jacob M. Koshy Vinay Kadiyala Anup Agarwal Bassman Tappuni Brianna French Sirus Jesudasen Christopher V. Cosgriff Rebanta Chakraborty Jillian Caldwell Susan Ziolkowski David J. Iberri Robert Diep Rahul S. Dalal Kira L. Newman Kristin Galetta J. Carl Pallais Nancy Wei Kathleen M. Buchheit David I. Hong Vartan Pahalyants Ernest Y. Lee Allen Shih Tamara B. Kaplan Vishnu Ravi Sarita Khemani Thomas A. Buckley April S. Liang Daniel Shirvani Advait Patil Nicholas Marshall Kanav Chopra Joel Koh Adi Badhwar Anastasia Perez Austin J. Schoeffler Mahbuba Tusty Chase M. Walton Liam G. McCoy David J. H. Wu Yingjie Weng Sumant Ranji Kevin Schulman Nigam H. Shah Jason Hom Arnold Milstein Arjun K. Manrai Adam Rodman Jonathan H. Chen Ethan Goh

This is my paper

classification 💻 cs.CY cs.AI

keywords harmmodelsadviceclinicalmedicalnoharmperformancephysicians

0 comments

read the original abstract

Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a 1,100-task benchmark of primary care-to-specialist consultation cases to measure the frequency and severity of harm from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 28 LLMs, recommendations carried the potential for severe harm in up to 22.6% of cases, with errors of omission accounting for more than 80% of severe errors. In a randomized trial of 101 generalist physicians, human benchmark performance significantly improved with AI assistance, yet physicians remained far from realizing the potential of AI tools, frequently ignoring essential advice surfaced by AI. Safety performance tracked general-intelligence and medical-knowledge benchmarks across the full range of models but decoupled at the frontier. Despite strong performance on existing evaluations, widely used AI models can produce medical advice with the potential for severe harm at non-trivial rates, highlighting the importance of explicit measurement of clinical safety.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
cs.CL 2026-05 unverdicted novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
Towards Conversational Medical AI with Eyes, Ears and a Voice
cs.AI 2026-05 conditional novelty 6.0

AI co-clinician is a multimodal conversational AI that uses live audio-visual data for real-time medical reasoning in simulated telemedicine, approaching primary care physicians in management plans and differentials b...
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
cs.AI 2026-04 unverdicted novelty 6.0

AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 pe...