Multi-turn jailbreak attacks on medical AI increase unsafe responses from 35% to 80% by turn 4, expose 19x model gaps invisible in single-turn tests, and a lightweight classifier reduces unsafe outputs by 52 points at the cost of 45% false alarms on benign queries.
M ed R isk E val: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
LLMs drop from 71.1% to 38.0% accuracy on medical questions when misleading context is injected, measured via new MedMisBench benchmark with 10,932 items.
citing papers explorer
-
MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety
Multi-turn jailbreak attacks on medical AI increase unsafe responses from 35% to 80% by turn 4, expose 19x model gaps invisible in single-turn tests, and a lightweight classifier reduces unsafe outputs by 52 points at the cost of 45% false alarms on benign queries.
-
Measuring Epistemic Resilience of LLMs Under Misleading Medical Context
LLMs drop from 71.1% to 38.0% accuracy on medical questions when misleading context is injected, measured via new MedMisBench benchmark with 10,932 items.