Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems, 37:55005–55029
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AI 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
citing papers explorer
-
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
-
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.