Recognition: unknown
Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment
read the original abstract
Large language models (LLMs) have revolutionized various applications, making robust safety alignment essential to prevent harmful outputs. Current safety alignment techniques, however, harbor inherent vulnerabilities due to their reliance on logit suppression. In this work, we identify critical logit-level vulnerabilities by introducing Semantic-sensitive Alignment and Generation (SSAG), a method designed to systematically manipulate output-layer logits without altering model parameters. Experiments on five popular LLMs show that SSAG exposes harmful responses with a 95% success rate while reducing response time by 86%. VulMine also demonstrates superior attack efficacy, achieving an average ASR of up to 77% against strong defensive mechanisms. These findings reveal crucial weaknesses in existing alignment methods, highlighting an urgent need for improved vulnerability detection and robust safety alignment strategies. Our code is available on github.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.