Multi-turn jailbreak attacks on medical AI increase unsafe responses from 35% to 80% by turn 4, expose 19x model gaps invisible in single-turn tests, and a lightweight classifier reduces unsafe outputs by 52 points at the cost of 45% false alarms on benign queries.
MedSafetyBench : Evaluating and Improving the Medical Safety of Large Language Models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
DrugBench evaluates AI control protocols on 3,671 medical conversations for four medication harm types and finds existing protocols subvertible, proposing severity-based monitoring instead.
citing papers explorer
-
DrugBench: Evaluating AI Control Protocols for Medication Harm Mitigation
DrugBench evaluates AI control protocols on 3,671 medical conversations for four medication harm types and finds existing protocols subvertible, proposing severity-based monitoring instead.