NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2roles
background 1polarities
background 1representative citing papers
OGPSA projects safety gradients orthogonal to a low-rank subspace from general capability gradients, improving safety-utility trade-offs in SFT and DPO pipelines on Qwen2.5-7B and Llama3.1-8B.
citing papers explorer
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
-
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
OGPSA projects safety gradients orthogonal to a low-rank subspace from general capability gradients, improving safety-utility trade-offs in SFT and DPO pipelines on Qwen2.5-7B and Llama3.1-8B.