LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
Is safety standard same for everyone? user-specific safety evaluation of large language models.arXiv preprint arXiv:2502.15086,
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 2
citation-polarity summary
years
2025 3roles
background 2polarities
support 2representative citing papers
citing papers explorer
-
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
- Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents