Weight orthogonalization unalignment enables LLMs to assist malicious activities more effectively than jailbreak-tuning, with less hallucination and better retained performance, while supervised fine-tuning mitigates the added attack capabilities.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Understanding the Effects of Safety Unalignment on Large Language Models
Weight orthogonalization unalignment enables LLMs to assist malicious activities more effectively than jailbreak-tuning, with less hallucination and better retained performance, while supervised fine-tuning mitigates the added attack capabilities.