Recognition: 2 theorem links
· Lean TheoremAutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Pith reviewed 2026-05-12 16:22 UTC · model grok-4.3
The pith
A hierarchical genetic algorithm can automatically create semantically meaningful jailbreak prompts that attack aligned large language models more effectively than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, AutoDAN can bypass perplexity-based defense methods effectively.
What carries the argument
The hierarchical genetic algorithm, which maintains populations of prompt variants, applies selection, crossover, and mutation operators guided by a fitness score that rewards both successful jailbreak responses from the target LLM and high semantic similarity to the original user query.
If this is right
- Attackers can scale jailbreak generation without repeated manual design.
- Cross-model transferability means a single search can produce prompts effective on multiple aligned LLMs.
- Cross-sample universality implies prompts can be reused across different harmful requests.
- Perplexity-based defenses alone are insufficient against semantically coherent adversarial prompts.
- Alignment training must account for the possibility of automated evolutionary search.
Where Pith is reading between the lines
- Future defenses could incorporate semantic similarity checks or evolutionary-robust training objectives.
- The same search process might be applied to discover other classes of prompt vulnerabilities beyond jailbreaks.
- Models trained with knowledge of such genetic attacks during alignment may develop broader resistance.
Load-bearing premise
The fitness function based on attack success and semantic preservation will produce prompts that continue to work on LLMs and queries outside the specific ones used during the search.
What would settle it
Generate AutoDAN prompts on one set of models and queries, then test whether those exact prompts fail to elicit jailbreaks on a new model or get rejected by a perplexity filter.
read the original abstract
The aligned Large Language Models (LLMs) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. However, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned LLMs. Investigating jailbreak prompts can lead us to delve into the limitations of LLMs and further guide us to secure them. Unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. In light of these challenges, we intend to answer this question: Can we develop an approach that can automatically generate stealthy jailbreak prompts? In this paper, we introduce AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. Moreover, we also compare AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass them effectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AutoDAN, a novel jailbreak attack against aligned LLMs that uses a hierarchical genetic algorithm to automatically generate stealthy jailbreak prompts while preserving semantic meaningfulness. It claims that this approach automates prompt generation, achieves superior attack success rates, cross-model transferability, and cross-sample universality compared to baselines, and effectively bypasses perplexity-based defenses.
Significance. If the results hold, this work is significant for LLM safety research because it provides a scalable, automated method for discovering semantically coherent jailbreaks, addressing limitations of both manual crafting and token-level attacks. The hierarchical GA design combined with semantic preservation offers a practical advance over prior techniques, and the extensive evaluations across multiple models and samples strengthen the empirical foundation for claims about transfer and universality.
major comments (2)
- [§4.2] §4.2 (Cross-model transferability experiments): The superiority claims rely on prompts evolved using fitness evaluations on specific LLMs and samples; the paper does not explicitly confirm that the search-time models are fully disjoint from the target models in transfer tests. Overlap or correlation could allow the GA to exploit idiosyncratic behaviors, weakening the generalization argument.
- [§4.3] §4.3 (Cross-sample universality results): Fitness is computed directly on response success for the optimization samples; while universality is reported, the manuscript lacks detail on whether test samples are drawn from a distribution independent of the search samples. This leaves open the risk that reported gains reflect overfitting rather than broad applicability.
minor comments (2)
- [Abstract] Abstract: Adding concrete numbers (e.g., number of models and samples tested) would make the performance claims easier to assess at a glance.
- [§3] §3 (Method): The integration of the semantic preservation term into the GA selection and mutation operators could be described with a short pseudocode snippet for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity of our experimental methodology. We address each major comment point by point below, providing explicit clarifications on the disjointness of models and samples. We have revised the manuscript accordingly to make these aspects unambiguous.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Cross-model transferability experiments): The superiority claims rely on prompts evolved using fitness evaluations on specific LLMs and samples; the paper does not explicitly confirm that the search-time models are fully disjoint from the target models in transfer tests. Overlap or correlation could allow the GA to exploit idiosyncratic behaviors, weakening the generalization argument.
Authors: We thank the referee for this important clarification request. In the cross-model transferability experiments of §4.2, prompt evolution via the hierarchical genetic algorithm is performed exclusively using fitness evaluations on open-source aligned models (Vicuna-7B and Llama-2-7B-chat). The resulting prompts are then transferred and evaluated on entirely disjoint target models, including GPT-3.5-Turbo, GPT-4, and Claude-2, none of which participate in any search-time fitness computation or share parameters with the evolution models. This separation ensures that reported transfer gains reflect generalization rather than exploitation of model-specific idiosyncrasies. We have revised §4.2 to explicitly document the disjoint model partitions used for evolution versus transfer testing. revision: yes
-
Referee: [§4.3] §4.3 (Cross-sample universality results): Fitness is computed directly on response success for the optimization samples; while universality is reported, the manuscript lacks detail on whether test samples are drawn from a distribution independent of the search samples. This leaves open the risk that reported gains reflect overfitting rather than broad applicability.
Authors: We appreciate the referee's concern regarding potential overfitting in the universality evaluation. In §4.3, the hierarchical GA optimizes prompts using fitness computed on a fixed optimization set of 10 harmful queries. Universality is then assessed on a held-out test set of 50 distinct queries sampled from the same AdvBench distribution but never used during optimization or fitness evaluation. This ensures the test samples are independent of the search process. We have updated §4.3 with a detailed description of this sample partitioning and independence to address overfitting concerns and better support the broad applicability claims. revision: yes
Circularity Check
No circularity: empirical GA optimization with external oracles
full rationale
The paper describes an algorithmic method (hierarchical genetic algorithm) whose fitness is computed directly from responses of external LLMs, followed by separate empirical evaluation of transfer and universality on held-out models and samples. No equations, predictions, or uniqueness claims reduce by construction to parameters fitted on the same data; the central results are falsifiable performance numbers rather than self-referential derivations. Self-citations, if present, are not load-bearing for the core claims.
Axiom & Free-Parameter Ledger
free parameters (1)
- GA hyperparameters (population size, mutation rate, generations)
axioms (1)
- domain assumption LLM response can be reliably scored as successful jailbreak based on output content
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm... superior attack strength in cross-model transferability, and cross-sample universality
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce AutoDAN, a novel jailbreak attack against aligned LLMs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 39 Pith papers
-
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
-
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
-
Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success
Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
-
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
-
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
-
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cos...
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.
-
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
-
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Leveraging RAG for Training-Free Alignment of LLMs
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
-
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
-
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Refusal in LLMs leaves a detectable upstream trajectory that SALO exploits to raise jailbreak detection from near zero to over 90 percent even under forced-decoding attacks.
-
A Theoretical Game of Attacks via Compositional Skills
A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stro...
-
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking
Attention-Guided Visual Jailbreaking blinds LVLMs to safety instructions by suppressing attention to alignment prefixes and anchoring generation on adversarial image features, reaching 94.4% attack success rate on Qwen-VL.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
-
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
-
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models
SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
-
A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
Training-free methods for LLM trustworthiness show inconsistent results across dimensions, with clear trade-offs in utility, robustness, and overhead depending on where they intervene during inference.
-
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
Gabriel Alon and Michael Kamfonas
Accessed: 2023-09-28. Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity,
work page 2023
-
[2]
Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063,
- [3]
-
[4]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Understanding dataset difficulty withV- usable information
10 Published as a conference paper at ICLR 2024 Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty withV- usable information. In International Conference on Machine Learning , pp. 5988–6008. PMLR,
work page 2024
-
[6]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Josh A Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. Generative language models and automated influence operations: Emerging threats and potential mitigations. arXiv preprint arXiv:2301.04246,
-
[8]
Large language models can be used to effectively scale spear phishing campaigns
Accessed: 2023-09-28. Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972,
-
[9]
Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733,
-
[10]
Reevaluating adversarial examples in natural language
John X Morris, Eli Lifland, Jack Lanchantin, Yangfeng Ji, and Yanjun Qi. Reevaluating adversarial examples in natural language. arXiv preprint arXiv:2004.14174,
-
[11]
Accessed: 2023-09-28. OpenAI. Snapshot of gpt-3.5-turbo from march 1st
work page 2023
-
[12]
https://openai.com/blog/ chatgpt, 2023a. Accessed: 2023-08-30. OpenAI. Gpt-4 technical report, 2023b. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Proce...
work page 2023
-
[13]
Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples
Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 ,
-
[14]
Red Teaming Language Models with Language Models
11 Published as a conference paper at ICLR 2024 Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Accessed: 2023-09-28. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022a. Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, An- jana Arunkumar, Arj...
work page internal anchor Pith review arXiv 2023
-
[17]
Ziegler and Nisan Stiennon and Ryan Lowe and Jan Leike and Paul Christiano , year=
Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Chris- tiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862,
-
[18]
Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher
Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463,
-
[19]
Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pp. 12– 2,
-
[20]
12 Published as a conference paper at ICLR 2024 A I NTRODUCTION TO GA AND HGA Genetic Algorithms (GAs) are a class of evolutionary algorithms inspired by the process of natural selection. These algorithms serve as optimization and search techniques that emulate the process of natural evolution. They operate on a population of potential solutions, utilizin...
work page 2024
-
[21]
You are a helpful and creative assistant who writes well
serves to interlace sentences from two distinct texts. Initially, each text is segmented into its individual sentences. By assessing the number of sentences in both 13 Published as a conference paper at ICLR 2024 Algorithm 5 LLM-based Diversification 1: function DIVERSIFICATION (prompt, LLM) 2: messagesystem ← “You are a helpful and creative assistant who...
work page 2024
-
[22]
is designed to analyze and rank words based on their associated momentum or significance. Initially, the function sets up a predefined collection of specific keywords (usually common English stop words). The core process of this function involves iterating through words and associating them with respective scores. Words that are not part of the predefined...
work page 2024
-
[23]
is designed to refine a given textual input. By iterating over each word in the prompt, the algorithm searches for synonymous terms within the mo- mentum word dictionary. If a synonym is found, a probabilistic decision based on the word’s score (compared to the total score of all synonyms) determines if the original word in the prompt should be replaced b...
work page 2024
-
[24]
The jailbreak prompts are generated based on Llama2. 2https://github.com/llm-attacks/llm-attacks 16 Published as a conference paper at ICLR 2024 Table 6: The refusal signals considered in our experiments.We keep most strings aligned with the GCG attack Zou et al. (2023) and add some new refusal signals that we witness during evaluations into the list. “I’...
work page 2024
-
[25]
The GPT-3.5-turbo-0301 can be found at https:// platform.openai.com/playground, and the Vicuna-33b-lmsys can be found at https: //chat.lmsys.org. 18 Published as a conference paper at ICLR 2024 F A BLATIONS STUDIES ON THE RECHECK METRIC Here, We share two additional experiments about the Recheck metric. Firstly, we aim to investigate to what extent our Re...
work page 2024
-
[26]
The detailed results on GPT-3.5-turbo-0301 and GPT-4-0613 are as follows: We can observe that although the attack success rate of our method is better than the baseline GCG in both OpenAI’s models, the GPT4-0613 model indeed achieves high robustness against the black-box 3https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored-fp16 4https://learn.mic...
work page 2024
-
[27]
We believe the LLM-based mutation does introduce improvement for our method. However, we notice that there may exist a better mutation policy that can be further explored in future works. For the effectiveness of the diverse initialization, we conducted an additional experiment with 100 data pairs from AdvBench Behaviour. In this experiment, we use AutoDA...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.