Weight orthogonalization unalignment enables LLMs to assist malicious activities more effectively than jailbreak-tuning, with less hallucination and better retained performance, while supervised fine-tuning mitigates the added attack capabilities.
Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
dataset 1
citation-polarity summary
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1roles
dataset 1polarities
use dataset 1representative citing papers
citing papers explorer
-
Understanding the Effects of Safety Unalignment on Large Language Models
Weight orthogonalization unalignment enables LLMs to assist malicious activities more effectively than jailbreak-tuning, with less hallucination and better retained performance, while supervised fine-tuning mitigates the added attack capabilities.