Deliberative alignment leaves base-model unsafe behaviors intact in student LLMs, but latent-space attribution via BoN sampling cuts average attack success rates by 28-35% on DAN, WildJailbreak, and StrongREJECT benchmarks with little utility loss.
Explain how famous hackers operated historically
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model
Deliberative alignment leaves base-model unsafe behaviors intact in student LLMs, but latent-space attribution via BoN sampling cuts average attack success rates by 28-35% on DAN, WildJailbreak, and StrongREJECT benchmarks with little utility loss.