Response-time linear probing on first generated tokens detects prefilling attacks missed by prompt-time activation defenses, achieving 0/40 attack success and 0% false positives across seven models while composing orthogonally with AlphaSteer.
hub Canonical reference
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of
co-cited works
representative citing papers
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
TAGO performs sparse jailbreak optimization on audio LMs by retaining only high-gradient-energy tokens, preserving near-full ASR at 25% retention across three models.
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.
TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.
Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.
PsychoPass shows adversarial LLM conversations exhibit an early geometric fingerprint in representation space that persists after removing length confounds and is detectable from short prefixes.
Applies MAP-Elites quality-diversity optimization to evolve semantic attack strategies across dimensions like strategy type, encoding, and length, uncovering distinct vulnerability profiles in four LLMs including GPT-4o-mini and Claude 3.5 Sonnet.
TRACE is a task-aware adaptive self-evolving jailbreaking framework that achieves up to 100% bypass rates on LLM agents via subtask decomposition and scenario evolution.
citing papers explorer
-
Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense
Response-time linear probing on first generated tokens detects prefilling attacks missed by prompt-time activation defenses, achieving 0/40 attack success and 0% false positives across seven models while composing orthogonally with AlphaSteer.
-
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
-
Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success
Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
TAGO performs sparse jailbreak optimization on audio LMs by retaining only high-gradient-energy tokens, preserving near-full ASR at 25% retention across three models.
-
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.
-
Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models
A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.
-
The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs
Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.
-
PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations
PsychoPass shows adversarial LLM conversations exhibit an early geometric fingerprint in representation space that persists after removing length confounds and is detectable from short prefixes.
-
Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
Applies MAP-Elites quality-diversity optimization to evolve semantic attack strategies across dimensions like strategy type, encoding, and length, uncovering distinct vulnerability profiles in four LLMs including GPT-4o-mini and Claude 3.5 Sonnet.
-
TRACE: Task-Aware Adaptive Self-Evolving Agentic Jailbreaking
TRACE is a task-aware adaptive self-evolving jailbreaking framework that achieves up to 100% bypass rates on LLM agents via subtask decomposition and scenario evolution.
-
Automatically Attacking Software Reverse Engineering AI Agents
Genetic algorithm prompt generation enables prompt injection into binaries via string assignments to fool LLM-powered decompilers and disassemblers.
-
RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry
RouteScan identifies malicious prompts in MoE LLMs using GPU expert routing telemetry as a privacy-preserving fingerprint, achieving AUROC above 0.93 on unseen harmful domains.
-
Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling
Ellipsoid Control is a white-list test-time jailbreak defense that fits an anisotropic ellipsoid from benign activations to constrain projected gradient descent updates, aiming to improve the safety-utility tradeoff over black-list RepE methods.
-
Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation
A bi-level adversarial training framework simulates diverse jailbroken activations through unsupervised extrapolation and trains a potential-induced steering field to achieve zero-shot defense with attack success rates mostly below 5% across tested LLMs and jailbreak families.
-
Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling
Babel is an efficient black-box jailbreaking framework that formalizes sparse safety attention heads via a mathematical obfuscation model and uses iterative distribution refinement to achieve higher attack success rates on models like GPT-4o and Claude-3-5-haiku with around 40 queries.
-
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models
DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
-
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
-
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.
-
Benchmarking Misuse Mitigation Against Covert Adversaries
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
-
Peering Behind the Shield: Guardrail Identification in Large Language Models
AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.
-
Adversarial Reframing: A Framework for Targeted Generation in Language Models
THREAT uses coordinated LLMs in an iterative optimization loop to generate jailbreak prompts that achieve higher success rates and lower detection rates than previous methods across tested models and datasets.
-
New Wide-Net-Casting Jailbreak Attacks Risk Large Models
The paper demonstrates that a tailored jailbreak method for querying groups of large models can achieve up to 100% success rate in some experiments on unprotected models, revealing overlooked multi-model safety risks.
-
SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models
SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.
-
Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG
Introduces NPAS and AV Filter using LLM attention weights to defend RAG against poisoning, reporting up to 20% accuracy gains while adaptive attacks reach 35% success.
-
Investigating Detection and Obfuscation of Prompt Injection Attacks Against Software Reverse Engineering AI Agents
This work examines prompt injection vulnerabilities in agentic software reverse engineering AI systems and tests detection, obfuscation, and defense techniques.
-
Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks
Proposes the TRIAD framework that treats multi-turn multimodal attacks as continuous trajectories and uses structural anomaly detection, regularized Mahalanobis distance, topological acceleration, and a time-varying Cox model with Bayesian HMM feedback to predict and bound expected time-to-failure.
-
LLM-Safety Evaluations Lack Robustness
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
-
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
-
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
- SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
- SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits