Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Exploiting novel gpt-4 apis
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
unclear 1representative citing papers
OBBR projects poisoned samples into benign space via rewriting with open-book examples, raising safety performance by 51% on average versus prior defenses across five attacks and four LLMs.
Weight orthogonalization unalignment enables LLMs to assist malicious activities more effectively than jailbreak-tuning, with less hallucination and better retained performance, while supervised fine-tuning mitigates the added attack capabilities.
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
citing papers explorer
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
OBBR projects poisoned samples into benign space via rewriting with open-book examples, raising safety performance by 51% on average versus prior defenses across five attacks and four LLMs.
-
Understanding the Effects of Safety Unalignment on Large Language Models
Weight orthogonalization unalignment enables LLMs to assist malicious activities more effectively than jailbreak-tuning, with less hallucination and better retained performance, while supervised fine-tuning mitigates the added attack capabilities.
-
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.