Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.
In: Proceedings 2024 Network and Distributed System Security Symposium
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
method 1polarities
use method 1representative citing papers
Incremental Completion Decomposition (ICD) jailbreaks LLMs via sequences of single-word continuations before full harmful responses, outperforming existing methods on AdvBench, JailbreakBench, and StrongREJECT with supporting mechanistic analysis.
SSAG bypasses logit suppression in five LLMs to produce harmful responses at 95% success rate and 86% lower latency; VulMine reaches 77% attack success against defenses.
citing papers explorer
-
Adaptive Instruction Composition for Automated LLM Red-Teaming
Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.
-
One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
Incremental Completion Decomposition (ICD) jailbreaks LLMs via sequences of single-word continuations before full harmful responses, outperforming existing methods on AdvBench, JailbreakBench, and StrongREJECT with supporting mechanistic analysis.
-
Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment
SSAG bypasses logit suppression in five LLMs to produce harmful responses at 95% success rate and 86% lower latency; VulMine reaches 77% attack success against defenses.