ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

Bingyu Zhu; Hui Xue; Jungang Lou; Kui Ren; Longtao Huang; Ningyu Zhang; Yan Wang; Yuefeng Chen; Zeyu Yang; Zhen Bi

read the original abstract

Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent in its reasoning but still predict a safe label, or issue an unsafe decision without policy-grounded justification. We identify this safety-critical failure mode as the deliberation-to-enforcement gap. Unlike general chain-of-thought faithfulness, guardrail reliability requires policy execution consistency: the generated reasoning should be grounded in the safety policy, and the final decision should be entailed by that reasoning. We propose ConsisGuard, a consistency-aware framework for reasoning-based LLM guardrails. ConsisGuard performs Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment, aligning the internal coupling between safety deliberation and decision enforcement. Experiments on prompt and response harmfulness detection benchmarks show that ConsisGuard improves detection performance while reducing policy execution failures. These results suggest that reliable reasoning-based guardrails require accurate faithful execution of safety policies.

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

discussion (0)