VLAF diagnostics show alignment faking is widespread in LLMs as small as 7B parameters, driven by consistent activation shifts that can be mitigated with contrastive steering vectors reducing faking by 58-94%.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.
Step-wise detection via a contrastive safety direction followed by remasking and adaptive steering reduces jailbreak success rates in diffusion language models to 0.64% while preserving output quality.
citing papers explorer
-
Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
VLAF diagnostics show alignment faking is widespread in LLMs as small as 7B parameters, driven by consistent activation shifts that can be mitigated with contrastive steering vectors reducing faking by 58-94%.
-
Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward
RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.
-
Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models
Step-wise detection via a contrastive safety direction followed by remasking and adaptive steering reduces jailbreak success rates in diffusion language models to 0.64% while preserving output quality.