pith. machine review for the scientific record. sign in

arxiv: 2405.13068 · v4 · submitted 2024-05-20 · 💻 cs.CR · cs.AI· cs.LG

Recognition: unknown

Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment

Authors on Pith no claims yet
classification 💻 cs.CR cs.AIcs.LG
keywords alignmentsafetyvulnerabilitiesharmfulllmslogitrobustssag
0
0 comments X
read the original abstract

Large language models (LLMs) have revolutionized various applications, making robust safety alignment essential to prevent harmful outputs. Current safety alignment techniques, however, harbor inherent vulnerabilities due to their reliance on logit suppression. In this work, we identify critical logit-level vulnerabilities by introducing Semantic-sensitive Alignment and Generation (SSAG), a method designed to systematically manipulate output-layer logits without altering model parameters. Experiments on five popular LLMs show that SSAG exposes harmful responses with a 95% success rate while reducing response time by 86%. VulMine also demonstrates superior attack efficacy, achieving an average ASR of up to 77% against strong defensive mechanisms. These findings reveal crucial weaknesses in existing alignment methods, highlighting an urgent need for improved vulnerability detection and robust safety alignment strategies. Our code is available on github.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    cs.CR 2024-03 accept novelty 6.0

    JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...