Deliberative alignment leaves base-model unsafe behaviors intact in student LLMs, but latent-space attribution via BoN sampling cuts average attack success rates by 28-35% on DAN, WildJailbreak, and StrongREJECT benchmarks with little utility loss.
Misinformation / Disinformation ###Policy Objective Mitigate the spread of false or misleading content
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model
Deliberative alignment leaves base-model unsafe behaviors intact in student LLMs, but latent-space attribution via BoN sampling cuts average attack success rates by 28-35% on DAN, WildJailbreak, and StrongREJECT benchmarks with little utility loss.