arxiv: 2310.04451 · v2 · submitted 2023-10-03 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu , Nan Xu , Muhao Chen , Chaowei Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 16:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords jailbreak attackslarge language modelsgenetic algorithmsadversarial promptsLLM alignmentstealthy attacksprompt generationtransferability

0 comments

The pith

A hierarchical genetic algorithm can automatically create semantically meaningful jailbreak prompts that attack aligned large language models more effectively than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that jailbreak attacks on aligned LLMs need not rely on manual prompt crafting or token-level changes that produce gibberish. AutoDAN instead evolves prompts through a hierarchical genetic algorithm that scores candidates on both attack success and semantic similarity to the original request. This automation yields prompts that transfer across different models and apply to many different harmful queries. The same prompts also bypass simple perplexity filters that catch less coherent attacks. If correct, this shows that alignment procedures leave exploitable gaps that can be searched systematically rather than found by hand.

Core claim

AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, AutoDAN can bypass perplexity-based defense methods effectively.

What carries the argument

The hierarchical genetic algorithm, which maintains populations of prompt variants, applies selection, crossover, and mutation operators guided by a fitness score that rewards both successful jailbreak responses from the target LLM and high semantic similarity to the original user query.

If this is right

Attackers can scale jailbreak generation without repeated manual design.
Cross-model transferability means a single search can produce prompts effective on multiple aligned LLMs.
Cross-sample universality implies prompts can be reused across different harmful requests.
Perplexity-based defenses alone are insufficient against semantically coherent adversarial prompts.
Alignment training must account for the possibility of automated evolutionary search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future defenses could incorporate semantic similarity checks or evolutionary-robust training objectives.
The same search process might be applied to discover other classes of prompt vulnerabilities beyond jailbreaks.
Models trained with knowledge of such genetic attacks during alignment may develop broader resistance.

Load-bearing premise

The fitness function based on attack success and semantic preservation will produce prompts that continue to work on LLMs and queries outside the specific ones used during the search.

What would settle it

Generate AutoDAN prompts on one set of models and queries, then test whether those exact prompts fail to elicit jailbreaks on a new model or get rejected by a perplexity filter.

read the original abstract

The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoDAN automates semantically coherent jailbreak prompts with a hierarchical genetic algorithm and shows empirical gains on transfer and universality, but the optimization loop invites some overfitting risk.

read the letter

AutoDAN's main advance is a hierarchical genetic algorithm that evolves jailbreak prompts while enforcing semantic coherence, which moves beyond manual crafting and the gibberish outputs from earlier token-level attacks. The approach keeps prompts readable enough to evade basic perplexity filters, and the paper reports that it outperforms baselines on attack success, cross-model transfer, and cross-sample universality. Those results are the practical payoff: an automated way to scale red-teaming without sacrificing stealth. The evaluations also include direct comparisons against perplexity defenses, which adds a useful data point for alignment work. Credit is due for making the method reproducible enough that others can test the same setup. The soft spot sits in the fitness function itself. Because success is measured directly against the models and queries used during search, the algorithm can latch onto model-specific quirks or sample patterns rather than general vulnerabilities. The reported cross tests help, yet their independence from the optimization distribution is not fully clear from the abstract, and any post-hoc prompt selection could inflate the apparent gains. Minor gaps in statistical detail or hyperparameter sensitivity would also benefit from tightening. This paper is aimed at the LLM safety and red-teaming community. It offers a concrete new tool with enough empirical grounding to merit referee time, even if the generalization claims need more scrutiny. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces AutoDAN, a novel jailbreak attack against aligned LLMs that uses a hierarchical genetic algorithm to automatically generate stealthy jailbreak prompts while preserving semantic meaningfulness. It claims that this approach automates prompt generation, achieves superior attack success rates, cross-model transferability, and cross-sample universality compared to baselines, and effectively bypasses perplexity-based defenses.

Significance. If the results hold, this work is significant for LLM safety research because it provides a scalable, automated method for discovering semantically coherent jailbreaks, addressing limitations of both manual crafting and token-level attacks. The hierarchical GA design combined with semantic preservation offers a practical advance over prior techniques, and the extensive evaluations across multiple models and samples strengthen the empirical foundation for claims about transfer and universality.

major comments (2)

[§4.2] §4.2 (Cross-model transferability experiments): The superiority claims rely on prompts evolved using fitness evaluations on specific LLMs and samples; the paper does not explicitly confirm that the search-time models are fully disjoint from the target models in transfer tests. Overlap or correlation could allow the GA to exploit idiosyncratic behaviors, weakening the generalization argument.
[§4.3] §4.3 (Cross-sample universality results): Fitness is computed directly on response success for the optimization samples; while universality is reported, the manuscript lacks detail on whether test samples are drawn from a distribution independent of the search samples. This leaves open the risk that reported gains reflect overfitting rather than broad applicability.

minor comments (2)

[Abstract] Abstract: Adding concrete numbers (e.g., number of models and samples tested) would make the performance claims easier to assess at a glance.
[§3] §3 (Method): The integration of the semantic preservation term into the GA selection and mutation operators could be described with a short pseudocode snippet for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity of our experimental methodology. We address each major comment point by point below, providing explicit clarifications on the disjointness of models and samples. We have revised the manuscript accordingly to make these aspects unambiguous.

read point-by-point responses

Referee: [§4.2] §4.2 (Cross-model transferability experiments): The superiority claims rely on prompts evolved using fitness evaluations on specific LLMs and samples; the paper does not explicitly confirm that the search-time models are fully disjoint from the target models in transfer tests. Overlap or correlation could allow the GA to exploit idiosyncratic behaviors, weakening the generalization argument.

Authors: We thank the referee for this important clarification request. In the cross-model transferability experiments of §4.2, prompt evolution via the hierarchical genetic algorithm is performed exclusively using fitness evaluations on open-source aligned models (Vicuna-7B and Llama-2-7B-chat). The resulting prompts are then transferred and evaluated on entirely disjoint target models, including GPT-3.5-Turbo, GPT-4, and Claude-2, none of which participate in any search-time fitness computation or share parameters with the evolution models. This separation ensures that reported transfer gains reflect generalization rather than exploitation of model-specific idiosyncrasies. We have revised §4.2 to explicitly document the disjoint model partitions used for evolution versus transfer testing. revision: yes
Referee: [§4.3] §4.3 (Cross-sample universality results): Fitness is computed directly on response success for the optimization samples; while universality is reported, the manuscript lacks detail on whether test samples are drawn from a distribution independent of the search samples. This leaves open the risk that reported gains reflect overfitting rather than broad applicability.

Authors: We appreciate the referee's concern regarding potential overfitting in the universality evaluation. In §4.3, the hierarchical GA optimizes prompts using fitness computed on a fixed optimization set of 10 harmful queries. Universality is then assessed on a held-out test set of 50 distinct queries sampled from the same AdvBench distribution but never used during optimization or fitness evaluation. This ensures the test samples are independent of the search process. We have updated §4.3 with a detailed description of this sample partitioning and independence to address overfitting concerns and better support the broad applicability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical GA optimization with external oracles

full rationale

The paper describes an algorithmic method (hierarchical genetic algorithm) whose fitness is computed directly from responses of external LLMs, followed by separate empirical evaluation of transfer and universality on held-out models and samples. No equations, predictions, or uniqueness claims reduce by construction to parameters fitted on the same data; the central results are falsifiable performance numbers rather than self-referential derivations. Self-citations, if present, are not load-bearing for the core claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard evolutionary computation assumptions applied to prompt search; no new physical entities or ad-hoc constants are introduced beyond typical GA hyperparameters.

free parameters (1)

GA hyperparameters (population size, mutation rate, generations)
Chosen to control the search process in the hierarchical genetic algorithm; values are not derived from first principles but tuned for performance.

axioms (1)

domain assumption LLM response can be reliably scored as successful jailbreak based on output content
Invoked when defining the fitness function for prompt evolution.

pith-pipeline@v0.9.0 · 5547 in / 1261 out tokens · 39146 ms · 2026-05-12T16:22:50.975990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm... superior attack strength in cross-model transferability, and cross-sample universality
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce AutoDAN, a novel jailbreak attack against aligned LLMs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
cs.LG 2026-05 unverdicted novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
cs.CR 2026-05 unverdicted novelty 7.0

FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success
cs.CR 2026-05 accept novelty 7.0

Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 7.0

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
cs.CR 2026-05 conditional novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
cs.CL 2026-05 unverdicted novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
cs.LG 2026-05 unverdicted novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
Jailbroken Frontier Models Retain Their Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
cs.CR 2026-04 unverdicted novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
cs.CL 2026-04 unverdicted novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cos...
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
cs.CR 2026-04 unverdicted novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
cs.CR 2026-04 unverdicted novelty 7.0

SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
cs.AI 2026-04 unverdicted novelty 7.0

PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
cs.CR 2026-04 unverdicted novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Leveraging RAG for Training-Free Alignment of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
cs.CR 2026-05 unverdicted novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 6.0

PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
cs.CR 2026-05 unverdicted novelty 6.0

Refusal in LLMs leaves a detectable upstream trajectory that SALO exploits to raise jailbreak detection from near zero to over 90 percent even under forced-decoding attacks.
A Theoretical Game of Attacks via Compositional Skills
cs.CL 2026-05 unverdicted novelty 6.0

A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stro...
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
cs.CR 2026-04 unverdicted novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
cs.CR 2026-04 unverdicted novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking
cs.CV 2026-04 unverdicted novelty 6.0

Attention-Guided Visual Jailbreaking blinds LVLMs to safety instructions by suppressing attention to alignment prefixes and anchoring generation on adversarial image features, reaching 94.4% attack success rate on Qwen-VL.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
cs.CR 2026-04 unverdicted novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
cs.CR 2026-04 unverdicted novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 6.0

SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
cs.CR 2024-03 accept novelty 6.0

JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Training-free methods for LLM trustworthiness show inconsistent results across dimensions, with clear trade-offs in utility, robustness, and overhead depending on where they intervene during inference.
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 36 Pith papers · 5 internal anchors

[1]

Gabriel Alon and Michael Kamfonas

Accessed: 2023-09-28. Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity,

work page 2023
[2]

Artificial-Analysis

Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063,

work page arXiv 1908
[3]

jailbreak

URL https: //lmsys.org/blog/2023-03-30-vicuna/ . Jon Christian. Amazing “jailbreak” bypasses chatgpt’s ethics safeguards. Futurism, February, 4: 2023,

work page 2023
[4]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Understanding dataset difficulty withV- usable information

10 Published as a conference paper at ICLR 2024 Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty withV- usable information. In International Conference on Machine Learning , pp. 5988–6008. PMLR,

work page 2024
[6]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Generative language models and automated influence operations: Emerging threats and potential mitigations

Josh A Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. Generative language models and automated influence operations: Emerging threats and potential mitigations. arXiv preprint arXiv:2301.04246,

work page arXiv
[8]

Large language models can be used to effectively scale spear phishing campaigns

Accessed: 2023-09-28. Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972,

work page arXiv 2023
[9]

Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733,

work page arXiv
[10]

Reevaluating adversarial examples in natural language

John X Morris, Eli Lifland, Jack Lanchantin, Yangfeng Ji, and Yanjun Qi. Reevaluating adversarial examples in natural language. arXiv preprint arXiv:2004.14174,

work page arXiv 2004
[11]

Accessed: 2023-09-28. OpenAI. Snapshot of gpt-3.5-turbo from march 1st

work page 2023
[12]

Accessed: 2023-08-30

https://openai.com/blog/ chatgpt, 2023a. Accessed: 2023-08-30. OpenAI. Gpt-4 technical report, 2023b. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Proce...

work page 2023
[13]

Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples

Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 ,

work page Pith review arXiv
[14]

Red Teaming Language Models with Language Models

11 Published as a conference paper at ICLR 2024 Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Accessed: 2023-09-28. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022a. Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, An- jana Arunkumar, Arj...

work page internal anchor Pith review arXiv 2023
[17]

Ziegler and Nisan Stiennon and Ryan Lowe and Jan Leike and Paul Christiano , year=

Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Chris- tiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862,

work page arXiv
[18]

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463,

work page arXiv
[19]

Red teaming

Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pp. 12– 2,

work page arXiv
[20]

These algorithms serve as optimization and search techniques that emulate the process of natural evolution

12 Published as a conference paper at ICLR 2024 A I NTRODUCTION TO GA AND HGA Genetic Algorithms (GAs) are a class of evolutionary algorithms inspired by the process of natural selection. These algorithms serve as optimization and search techniques that emulate the process of natural evolution. They operate on a population of potential solutions, utilizin...

work page 2024
[21]

You are a helpful and creative assistant who writes well

serves to interlace sentences from two distinct texts. Initially, each text is segmented into its individual sentences. By assessing the number of sentences in both 13 Published as a conference paper at ICLR 2024 Algorithm 5 LLM-based Diversification 1: function DIVERSIFICATION (prompt, LLM) 2: messagesystem ← “You are a helpful and creative assistant who...

work page 2024
[22]

Initially, the function sets up a predefined collection of specific keywords (usually common English stop words)

is designed to analyze and rank words based on their associated momentum or significance. Initially, the function sets up a predefined collection of specific keywords (usually common English stop words). The core process of this function involves iterating through words and associating them with respective scores. Words that are not part of the predefined...

work page 2024
[23]

By iterating over each word in the prompt, the algorithm searches for synonymous terms within the mo- mentum word dictionary

is designed to refine a given textual input. By iterating over each word in the prompt, the algorithm searches for synonymous terms within the mo- mentum word dictionary. If a synonym is found, a probabilistic decision based on the word’s score (compared to the total score of all synonyms) determines if the original word in the prompt should be replaced b...

work page 2024
[24]

I’m sorry

The jailbreak prompts are generated based on Llama2. 2https://github.com/llm-attacks/llm-attacks 16 Published as a conference paper at ICLR 2024 Table 6: The refusal signals considered in our experiments.We keep most strings aligned with the GCG attack Zou et al. (2023) and add some new refusal signals that we witness during evaluations into the list. “I’...

work page 2024
[25]

However, this is illegal

The GPT-3.5-turbo-0301 can be found at https:// platform.openai.com/playground, and the Vicuna-33b-lmsys can be found at https: //chat.lmsys.org. 18 Published as a conference paper at ICLR 2024 F A BLATIONS STUDIES ON THE RECHECK METRIC Here, We share two additional experiments about the Recheck metric. Firstly, we aim to investigate to what extent our Re...

work page 2024
[26]

The detailed results on GPT-3.5-turbo-0301 and GPT-4-0613 are as follows: We can observe that although the attack success rate of our method is better than the baseline GCG in both OpenAI’s models, the GPT4-0613 model indeed achieves high robustness against the black-box 3https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored-fp16 4https://learn.mic...

work page 2024
[27]

discards

We believe the LLM-based mutation does introduce improvement for our method. However, we notice that there may exist a better mutation policy that can be further explored in future works. For the effectiveness of the diverse initialization, we conducted an additional experiment with 100 data pairs from AdvBench Behaviour. In this experiment, we use AutoDA...

work page 2023