Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes

Hu, X · 2024 · DOI 10.52202/079017-4011

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

cs.CR · 2026-06-26 · unverdicted · novelty 6.0

Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads robust in mid-layers, enabling competitive detection from persistent activations.

citing papers explorer

Showing 1 of 1 citing paper.

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models cs.CR · 2026-06-26 · unverdicted · none · ref 17
Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads robust in mid-layers, enabling competitive detection from persistent activations.

Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes

fields

years

verdicts

representative citing papers

citing papers explorer