A behavioral monitoring technique using HTTP, lexical, and timing signals detects guardrail presence with 100% accuracy and distinguishes guardrail blocks from LLM rejections with 98% average F1 on unseen prompts.
and Kiekintveld, Christopher and Laszka, Aron , title =
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring
A behavioral monitoring technique using HTTP, lexical, and timing signals detects guardrail presence with 100% accuracy and distinguishes guardrail blocks from LLM rejections with 98% average F1 on unseen prompts.