pith. sign in

XSTest : A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

years

2026 4 2025 2

representative citing papers

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

cs.AI · 2025-09-30 · unverdicted · novelty 6.0

ASGuard identifies tense-vulnerable attention heads via circuit analysis, trains precise scaling vectors on those activations, and applies them in preventative fine-tuning to reduce targeted jailbreaking success across four LLMs with minimal impact on utility or over-refusal.

citing papers explorer

Showing 6 of 6 citing papers.