A novel evaluation benchmark for medical llms illuminating safety and effectiveness in clinical domains.npj Digital Medicine

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, et al · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

cs.AI · 2026-04-29 · unverdicted · novelty 5.0

LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.

citing papers explorer

Showing 1 of 1 citing paper.

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control cs.AI · 2026-04-29 · unverdicted · none · ref 15
LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.

A novel evaluation benchmark for medical llms illuminating safety and effectiveness in clinical domains.npj Digital Medicine

fields

years

verdicts

representative citing papers

citing papers explorer