Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.
Exploiting the index gradients for optimization-based jailbreaking on large language models.arXiv preprint arXiv:2412.08615, 2024
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models
Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.