ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
background 1representative citing papers
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
AI security and alignment cannot achieve full robustness because any sufficiently powerful AI inherits incompleteness-style limitations from formal systems.
citing papers explorer
-
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
-
Robust AI Security and Alignment: A Sisyphean Endeavor?
AI security and alignment cannot achieve full robustness because any sufficiently powerful AI inherits incompleteness-style limitations from formal systems.