PKU- SafeRLHF: Towards multi-level safety alignment for LLMs with human preference

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, Sirui Han, Yike Guo, Yaodong Yang · 2025 · DOI 10.18653/v1/2025.acl-long.1544

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

cs.CL · 2026-06-18 · unverdicted · novelty 5.0

Sequential DPO produces varied effects on prior preferences (partial degradation, stability, pair-level redistribution, or positive transfer) depending on objective relationships rather than uniform forgetting.

The Consensus Trap: Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation

cs.AI · 2026-02-11 · unverdicted · novelty 5.0

A literature review concludes that pursuing consensus in data annotation creates biased AI by dismissing subjective disagreements and enforcing geographic hegemony, and proposes mapping diversity instead.

Position: AI Safety Requires Effective Controllability

cs.AI · 2026-05-26 · unverdicted · novelty 4.0

Position paper claiming that AI safety requires explicit runtime controllability and introducing ControlBench to demonstrate gaps in existing alignment methods.

citing papers explorer

Showing 1 of 1 citing paper after filters.

The Consensus Trap: Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation cs.AI · 2026-02-11 · unverdicted · none · ref 144
A literature review concludes that pursuing consensus in data annotation creates biased AI by dismissing subjective disagreements and enforcing geographic hegemony, and proposes mapping diversity instead.

PKU- SafeRLHF: Towards multi-level safety alignment for LLMs with human preference

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer