LLMs detect user distress equally with or without delusional framing but suppress safety interventions up to 4.5x more when distress is embedded in delusions.
Is safety standard same for everyone? user-specific safety evaluation of large language models.arXiv preprint arXiv:2502.15086
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
background 2
citation-polarity summary
roles
background 2polarities
support 2representative citing papers
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
citing papers explorer
-
Lost in Delusion: Examining LLM Safety Under User Delusions and Distress
LLMs detect user distress equally with or without delusional framing but suppress safety interventions up to 4.5x more when distress is embedded in delusions.
-
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
- LLM Harms: A Taxonomy and Discussion
- Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents