Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LLMs display systematic, architecture-dependent gaps between their self-stated safety policies and observed behavior on harmful prompts, with absolute refusal claims frequently violated.
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
ASGuard identifies tense-vulnerable attention heads via circuit analysis, trains precise scaling vectors on those activations, and applies them in preventative fine-tuning to reduce targeted jailbreaking success across four LLMs with minimal impact on utility or over-refusal.
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
citing papers explorer
-
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles
Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
-
Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
LLMs display systematic, architecture-dependent gaps between their self-stated safety policies and observed behavior on harmful prompts, with absolute refusal claims frequently violated.
-
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
-
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
ASGuard identifies tense-vulnerable attention heads via circuit analysis, trains precise scaling vectors on those activations, and applies them in preventative fine-tuning to reduce targeted jailbreaking success across four LLMs with minimal impact on utility or over-refusal.
-
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
- Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models