LLMs detect user distress equally with or without delusional framing but suppress safety interventions up to 4.5x more when distress is embedded in delusions.
Is safety standard same for everyone? user-specific safety evaluation of large language models.arXiv preprint arXiv:2502.15086
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
background 2
citation-polarity summary
roles
background 2polarities
support 2representative citing papers
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
citing papers explorer
-
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.