LLMs display systematic, architecture-dependent gaps between their self-stated safety policies and observed behavior on harmful prompts, with absolute refusal claims frequently violated.
XSTest: A test suite for identifying exaggerated safety behaviours in large language models
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
LLMs display systematic, architecture-dependent gaps between their self-stated safety policies and observed behavior on harmful prompts, with absolute refusal claims frequently violated.