XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy · 2024 · DOI 10.18653/v1/2024.naacl-long.301

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open at publisher browse 6 citing papers

representative citing papers

Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

cs.CY · 2026-04-22 · unverdicted · novelty 6.0

Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

cs.CL · 2026-04-10 · unverdicted · novelty 6.0

LLMs display systematic, architecture-dependent gaps between their self-stated safety policies and observed behavior on harmful prompts, with absolute refusal claims frequently violated.

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

cs.AI · 2025-09-30 · unverdicted · novelty 6.0

ASGuard identifies tense-vulnerable attention heads via circuit analysis, trains precise scaling vectors on those activations, and applies them in preventative fine-tuning to reduce targeted jailbreaking success across four LLMs with minimal impact on utility or over-refusal.

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

cs.CL · 2025-08-06 · unverdicted · novelty 6.0

ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

cs.AI · 2026-05-07

citing papers explorer

Showing 6 of 6 citing papers.

Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles cs.CY · 2026-04-22 · unverdicted · none · ref 22
Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies cs.CL · 2026-04-10 · unverdicted · none · ref 5
LLMs display systematic, architecture-dependent gaps between their self-stated safety policies and observed behavior on harmful prompts, with absolute refusal claims frequently violated.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 30
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack cs.AI · 2025-09-30 · unverdicted · none · ref 4
ASGuard identifies tense-vulnerable attention heads via circuit analysis, trains precise scaling vectors on those activations, and applies them in preventative fine-tuning to reduce targeted jailbreaking success across four LLMs with minimal impact on utility or over-refusal.
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments cs.CL · 2025-08-06 · unverdicted · none · ref 24
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unreviewed · ref 18

XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer