MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.
arXiv preprint arXiv:2602.10367 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CY 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.
citing papers explorer
-
When Medical Safety Alignment Fails: A Benchmark for Evaluating LLMs on High-Risk Medical Queries
MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.
-
LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment
Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.