SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
PKU- SafeRLHF: Towards multi-level safety alignment for LLMs with human preference
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Sequential DPO produces varied effects on prior preferences (partial degradation, stability, pair-level redistribution, or positive transfer) depending on objective relationships rather than uniform forgetting.
A literature review concludes that pursuing consensus in data annotation creates biased AI by dismissing subjective disagreements and enforcing geographic hegemony, and proposes mapping diversity instead.
Position paper claiming that AI safety requires explicit runtime controllability and introducing ControlBench to demonstrate gaps in existing alignment methods.
citing papers explorer
-
The Consensus Trap: Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation
A literature review concludes that pursuing consensus in data annotation creates biased AI by dismissing subjective disagreements and enforcing geographic hegemony, and proposes mapping diversity instead.