Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads robust in mid-layers, enabling competitive detection from persistent activations.
Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models
Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads robust in mid-layers, enabling competitive detection from persistent activations.