Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

· 2026 · cs.CL · arXiv 2604.09189

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.

representative citing papers

Self-CTRL: Self-Consistency Training with Reinforcement Learning

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Self-CTRL uses RL to align LM self-explanations with behavior, boosting bias correlation to R²=0.64 and refusal prediction accuracy to 92% while cutting harm failures to 0.5%.

citing papers explorer

Showing 1 of 1 citing paper.

Self-CTRL: Self-Consistency Training with Reinforcement Learning cs.LG · 2026-06-16 · unverdicted · none · ref 30 · internal anchor
Self-CTRL uses RL to align LM self-explanations with behavior, boosting bias correlation to R²=0.64 and refusal prediction accuracy to 92% while cutting harm failures to 0.5%.

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

fields

years

verdicts

representative citing papers

citing papers explorer