Revisiting the robust alignment of circuit breakers.CoRR, abs/2407.15902

Leo Schwinn, Simon Geisler · 2024 · arXiv 2407.15902

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Exploring and Developing a Pre-Model Safeguard with Draft Models

cs.CR · 2026-05-19 · unverdicted · novelty 6.0

A safeguard that uses speculative inference on small language models to produce draft responses for safety prediction, lowering false negatives in pre-model jailbreak detection.

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

cs.CR · 2026-05-04 · accept · novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

cs.CL · 2026-04-10 · unverdicted · novelty 6.0 · 2 refs

Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.

citing papers explorer

Showing 3 of 3 citing papers.

Exploring and Developing a Pre-Model Safeguard with Draft Models cs.CR · 2026-05-19 · unverdicted · none · ref 48
A safeguard that uses speculative inference on small language models to produce draft responses for safety prediction, lowering false negatives in pre-model jailbreak detection.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses cs.CR · 2026-05-04 · accept · none · ref 41
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism cs.CL · 2026-04-10 · unverdicted · none · ref 21 · 2 links
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.

Revisiting the robust alignment of circuit breakers.CoRR, abs/2407.15902

fields

years

verdicts

representative citing papers

citing papers explorer