Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
Pku-saferlhf: Towards multi-level safety alignment for llms with human preference
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Cat-DPO applies per-category adaptive safety margins during direct preference optimization to reduce variance in safety across harm categories.
citing papers explorer
-
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
-
Cat-DPO: Category-Adaptive Safety Alignment
Cat-DPO applies per-category adaptive safety margins during direct preference optimization to reduce variance in safety across harm categories.