Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
LLMs often fail to redirect health questions containing misconceptions, unlike clinicians, exposing safety gaps in patient-facing medical AI.
Reasoning models achieve only 2-11% higher accuracy than non-reasoning models when handling queries with false presuppositions, failing to challenge 26-42% of them and remaining sensitive to presupposition strength.
citing papers explorer
-
Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
-
MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication
LLMs often fail to redirect health questions containing misconceptions, unlike clinicians, exposing safety gaps in patient-facing medical AI.
-
Evaluating Reasoning Models for Queries with Presuppositions
Reasoning models achieve only 2-11% higher accuracy than non-reasoning models when handling queries with false presuppositions, failing to challenge 26-42% of them and remaining sensitive to presupposition strength.