MedMistake automatically generates 3,390 single-shot QA pairs capturing LLM mistakes in medical conversations, with expert validation on a 211-question subset showing performance differences among 12 frontier models.
Healthcare agent: eliciting the power of large language models for medical consultation
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Automatic Replication of LLM Mistakes in Medical Conversations
MedMistake automatically generates 3,390 single-shot QA pairs capturing LLM mistakes in medical conversations, with expert validation on a 211-question subset showing performance differences among 12 frontier models.