pith. sign in

arxiv: 2511.12710 · v2 · pith:LJBYYYBZnew · submitted 2025-11-16 · 💻 cs.CL · cs.CR

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Pith reviewed 2026-05-21 18:53 UTC · model grok-4.3

classification 💻 cs.CL cs.CR
keywords jailbreak attacksLLM red teamingevolutionary code synthesismulti-agent systemsautomated attack generationLLM securitycode-level optimization
0
0 comments X

The pith

Optimizing executable code for jailbreak generation rather than prompt wording produces more effective and diverse attacks on LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that red-teaming methods limited to refining attack prompts miss an opportunity to improve the underlying procedure that creates those prompts. By instead evolving complete executable code that includes logic, branching, and repair steps, the approach can generate attacks that adapt to model feedback during development. A sympathetic reader would care because this shift could produce jailbreaks that transfer better across models and prove harder for static defenses to block, since the attack method itself improves over time rather than just the wording of any single input.

Core claim

The central claim is that moving optimization into code space via a multi-agent system that engineers, evolves, and self-corrects attack algorithms yields higher attack success rates and greater diversity than prompt-space methods. Experiments reported in the paper show the resulting attacks reach an 85.5 percent success rate on Claude-Sonnet-4.5 and average 95.9 percent across tested targets while producing outputs that differ more from one another than those generated by prior techniques.

What carries the argument

The multi-agent system that autonomously writes, executes, and iteratively rewrites executable code for generating jailbreak prompts in response to target-model feedback and failed attempts.

If this is right

  • Attack generation can optimize execution flow, reusable components, and failure-driven repair in addition to the final prompt text.
  • The resulting attacks transfer across multiple evaluated LLM targets at high success rates.
  • Generated attacks exhibit greater diversity than those produced by prompt-optimization baselines.
  • Releasing the framework supports further work on evolutionary methods that operate directly in executable code space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenders may eventually need to analyze or restrict dynamic code patterns rather than only static prompt patterns.
  • The same evolutionary approach could be tested on other optimization tasks such as generating adversarial examples in non-language domains.
  • Success may hinge on the initial agent capabilities or seed code quality, pointing to a possible direction for improving the multi-agent setup itself.

Load-bearing premise

That the self-correction process will reliably produce code that generalizes to new models and stays effective as defenses change.

What would settle it

A new large language model released after the experiments on which the evolved code-based attacks achieve substantially lower success rates than the reported averages would indicate the method does not produce reliably generalizable algorithms.

Figures

Figures reproduced from arXiv: 2511.12710 by Jie Li, Juncheng Li, Xingjun Ma, Xin Wang, Yan Teng, Yingchun Wang, Yixu Wang, Yunhao Chen.

Figure 1
Figure 1. Figure 1: An overview of our proposed EvoSynth method. The process begins with the Reconnaissance Agent formulating a strategy. The [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Diversity Comparison of Generated Attack Prompts. The raincloud plot shows the distribution of pairwise diversity scores for prompts from the X-Teaming dataset and those gener￾ated by EvoSynth. The wider distribution and higher median score for EvoSynth indicate that our framework synthesizes a more se￾mantically diverse and non-redundant set of attacks. 4.3. Analysis of Attack Diversity and Case Study of … view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative Convergence of Attack Success. The plots show the cumulative percentage of sessions that have achieved their highest score by a given point in time. (Left) Convergence by the tool’s code evolution iteration number. (Right) Convergence by the total number of agent actions taken in the session. Both plots demonstrate rapid convergence, with the majority of optimal attacks being discovered early in… view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative Distribution of Attack Algorithm Trans￾ferability. The plot shows the cumulative percentage of all synthe￾sized attack algorithms (y-axis) that meet or exceed a given usage percentage (x-axis). The curve demonstrates that while many algo￾rithms are specialized, a significant portion are highly transferable, with 20% of all algorithms being effective enough to be used on over 80%. 4.6. Analysis o… view at source ↗
read the original abstract

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet many still formulate attack optimization primarily in the prompt space. In other words, these methods mainly search for better attack wording or better strategy choices, but they do not search over executable code. By moving the search into code space, we can optimize not only the final attack prompt, but also the procedure that generates it, including execution flow, reusable logic, branching, and failure-driven repair. To overcome this gap, we introduce EvoSynth, an autonomous multi-agent framework that shifts the optimization space from prompts to executable code. Instead of refining prompts directly, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite the code-based algorithm in response to target-model feedback and failed attempts. Through extensive experiments, we demonstrate that EvoSynth achieves an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5 and a 95.9\% average ASR across evaluated targets, while generating attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research on evolutionary synthesis in executable code space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces EvoSynth, an autonomous multi-agent framework that evolves executable code for generating jailbreak attacks on LLMs rather than optimizing prompts directly. It incorporates a code-level self-correction loop that rewrites attack algorithms based on target-model feedback. The central empirical claims are an 85.5% attack success rate (ASR) on Claude-Sonnet-4.5 and a 95.9% average ASR across evaluated targets, together with significantly higher attack diversity than prior methods. The framework is released to support further research.

Significance. If the performance and diversity claims are supported by rigorous controls, this work would advance automated red-teaming by shifting optimization from prompt space to executable code space, enabling procedural, repair-capable attack algorithms that may prove more transferable. The public release of the framework constitutes a concrete contribution to reproducibility.

major comments (2)
  1. Abstract: the headline ASR figures (85.5% on Claude-Sonnet-4.5, 95.9% average) are presented without any description of the baseline methods, evaluation protocol, statistical tests, or controls for model-specific effects, so the magnitude and reliability of the claimed gains cannot be assessed from the given text.
  2. Abstract / Experimental Results: the description supplies no indication of a held-out query partition or cross-model zero-shot transfer test for the final synthesized code. Without such safeguards, the reported success rates could arise from query-specific repair logic rather than generalizable attack procedures, directly undermining the central claim that the evolutionary synthesis produces broadly useful algorithms.
minor comments (1)
  1. Abstract: the diversity claim is stated qualitatively ('significantly more diverse'); a brief definition of the diversity metric and quantitative comparison would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which highlight opportunities to strengthen the presentation and evaluation of our work. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: Abstract: the headline ASR figures (85.5% on Claude-Sonnet-4.5, 95.9% average) are presented without any description of the baseline methods, evaluation protocol, statistical tests, or controls for model-specific effects, so the magnitude and reliability of the claimed gains cannot be assessed from the given text.

    Authors: We agree that the abstract is too high-level to convey the experimental context. The full manuscript details the baselines (including prompt- and code-based methods from prior work), the evaluation protocol (standard harmful query sets, ASR metric, multiple target models with controls for model-specific behaviors), and comparative analyses. To improve accessibility, we will revise the abstract to briefly note the comparison to existing methods and the multi-model evaluation setting. revision: yes

  2. Referee: Abstract / Experimental Results: the description supplies no indication of a held-out query partition or cross-model zero-shot transfer test for the final synthesized code. Without such safeguards, the reported success rates could arise from query-specific repair logic rather than generalizable attack procedures, directly undermining the central claim that the evolutionary synthesis produces broadly useful algorithms.

    Authors: This is a fair critique of the generalizability evidence. While the self-correction loop and procedural code design aim to produce reusable algorithms, and experiments cover diverse models and queries, the manuscript does not explicitly report held-out query partitions or zero-shot cross-model transfer results for the evolved code. We will add these analyses in the revision to directly support the claim of broadly useful algorithms. revision: yes

Circularity Check

0 steps flagged

Empirical results from evolutionary code synthesis framework show no circular derivation

full rationale

The paper introduces EvoSynth as a multi-agent evolutionary framework for synthesizing executable jailbreak code and reports experimental Attack Success Rates (e.g., 85.5% on Claude-Sonnet-4.5) obtained through direct testing. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps are present; the central claims rest on empirical evaluation rather than any reduction of outputs to inputs by construction. The work is self-contained against external benchmarks via reported ASR metrics and diversity comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The contribution rests on standard assumptions about attack success rate as a metric and the feasibility of code evolution for prompt generation; no free parameters or invented physical entities are introduced.

invented entities (1)
  • EvoSynth multi-agent code-evolution framework no independent evidence
    purpose: Autonomously engineer and evolve executable attack algorithms with self-correction
    The framework is the primary new artifact introduced to move optimization into code space.

pith-pipeline@v0.9.0 · 5786 in / 1105 out tokens · 49787 ms · 2026-05-21T18:53:59.074984+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    TAME uses a Mixture-of-Experts prompt bank with input-dependent routing and three unsupervised objectives to adaptively defend CLIP against adversarial attacks at inference time, achieving at least 49.1% robustness ga...

Reference graph

Works this paper leans on

142 extracted references · 142 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    Introducing Claude, 2023

    Anthropic. Introducing Claude, 2023. 5

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Con- stitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022. 2, 3

  3. [3]

    Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,

    Joseph Biden. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,

  4. [4]

    A realistic threat model for large language model jailbreaks

    Valentyn Boreiko, Alexander Panfilov, Vaclav V oracek, Matthias Hein, and Jonas Geiping. A realistic threat model for large language model jailbreaks. InNeurIPS, 2024. 1

  5. [5]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023. 1, 2, 3, 5, 6

  6. [6]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Flo- rian Tramer, et al. Jailbreakbench: An open robustness bench- mark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024. 1

  7. [7]

    When llm meets drl: Advancing jailbreaking efficiency via drl-guided search, 2024

    Xuan Chen, Yuzhou Nie, Wenbo Guo, and Xiangyu Zhang. When llm meets drl: Advancing jailbreaking efficiency via drl-guided search, 2024. 2

  8. [8]

    Extracting training data from unconditional diffusion models

    Yunhao Chen, Shujie Wang, Difan Zou, and Xingjun Ma. Extracting training data from unconditional diffusion models. arXiv preprint arXiv:2410.02467, 2024. 13

  9. [9]

    DeepSeek-AI, Xiao Bi, Deli Chen, Guanting Chen, Shan- huang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wen- jun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A...

  10. [10]

    h4rm3l: A language for composable jailbreak attack synthesis.arXiv preprint arXiv:2408.04811, 2024

    Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, and Christopher D Manning. h4rm3l: A language for composable jailbreak attack synthesis.arXiv preprint arXiv:2408.04811, 2024. 2

  11. [11]

    Llama guard 3-1b-int4: Compact and 9 efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024

    Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, et al. Llama guard 3-1b-int4: Compact and 9 efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024. 3

  12. [12]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming lan- guage models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022. 2

  13. [13]

    Mart: Improving llm safety with multi-round automatic red-teaming,

    Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi- Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. Mart: Improving llm safety with multi-round automatic red-teaming,

  14. [14]

    Simon Geisler, Johannes C. Z. Gane, Alianda Lopez, Corina PġCtraÈ ´Zcanu, Paul-Ambroise Duquenne, Thomas Hofmann, and V olkan Cevher. Attacking large language models with projected gradient descent, 2024. 2

  15. [15]

    K. H. Hung et al. Attention tracker: Detecting prompt injec- tion attacks in llms. InFindings of NAACL, 2025. 3

  16. [16]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1

  17. [17]

    Codeattack: Code-based adversarial attacks for pre-trained programming language models

    Akshita Jha and Chandan K Reddy. Codeattack: Code-based adversarial attacks for pre-trained programming language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14892–14900, 2023. 2, 5, 6

  18. [18]

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024. 2

  19. [19]

    Red queen: Safe- guarding large language models against concealed multi-turn jailbreaking.arXiv preprint arXiv:2409.17458, 2024

    Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara, and Subhabrata Mukherjee. Red queen: Safe- guarding large language models against concealed multi-turn jailbreaking.arXiv preprint arXiv:2409.17458, 2024. 5, 6

  20. [20]

    Open sesame! universal black box jailbreaking of large language models

    Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446, 2023. 2

  21. [21]

    A new generation of perspective api: Efficient multilingual character-level trans- formers

    Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. A new generation of perspective api: Efficient multilingual character-level trans- formers. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3197–3207,

  22. [22]

    Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models

    Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024. 1

  23. [23]

    Llm defenses are not robust to multi-turn human jailbreaks yet.ArXiv, 2024

    Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. Llm defenses are not robust to multi-turn human jailbreaks yet.ArXiv, 2024. 2

  24. [24]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

  25. [26]

    arXiv preprint arXiv:2410.05295 (2024)

    Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Z. Morley Mao, Somesh Jha, Patrick Drew McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms.ArXiv, abs/2410.05295, 2024. 2, 6

  26. [27]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jail- breaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023. 2

  27. [28]

    Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

    Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model safety.arXiv preprint arXiv:2502.05206, 2025. 2

  28. [29]

    A holistic approach to undesired con- tent detection in the real world

    Todor Markov, Chong Zhang, Sandhini Agarwal, Floren- tine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired con- tent detection in the real world. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15009–15018,

  29. [30]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024. 1, 2, 5

  30. [31]

    Tree of attacks: Jailbreaking black-box llms automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023. 1, 2, 5, 6

  31. [32]

    Introducing meta llama 3, 2024

    Meta. Introducing meta llama 3, 2024. 5

  32. [33]

    Fight back against jailbreaking via prompt adversarial tuning

    Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1

  33. [34]

    Nvidia nemo guardrails, 2024

    NVIDIA. Nvidia nemo guardrails, 2024. 3

  34. [35]

    Gpt-4 technical report, 2024

    OpenAI. Gpt-4 technical report, 2024. 2, 5

  35. [36]

    Evo-marl: Co-evolutionary multi-agent reinforcement learning for internalized safety

    Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S Yu, Manling Li, et al. Evo-marl: Co-evolutionary multi-agent reinforcement learning for internalized safety. arXiv preprint arXiv:2508.03864, 2025. 2

  36. [37]

    Red Teaming Language Models with Language Models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Ro- man Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022. 2

  37. [38]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

  38. [39]

    X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents.arXiv preprint arXiv:2504.13203, 2025

    Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel. X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents.arXiv preprint arXiv:2504.13203, 2025. 1, 3, 5, 6

  39. [40]

    Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues.arXiv preprint arXiv:2410.10700, 2024

    Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues.arXiv preprint arXiv:2410.10700, 2024. 5, 6

  40. [41]

    Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack.ArXiv, abs/2404.01833, 2024. 5, 6

  41. [42]

    Rain- bow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024

    Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rain- bow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024. 1, 3, 5, 6

  42. [43]

    Adversarial attacks and defenses in large lan- guage models: Old and new threats

    Leo Schwinn, David Dobre, Stephan Günnemann, and Gau- thier Gidel. Adversarial attacks and defenses in large lan- guage models: Old and new threats. InProceedings on, pages 103–117. PMLR, 2023. 1

  43. [44]

    L1B3RT45: Jailbreaks for All Flagship AI Models, 2024

    Pliny the Prompter. L1B3RT45: Jailbreaks for All Flagship AI Models, 2024. 2

  44. [45]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 2

  45. [46]

    Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment

    Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, and Chaowei Xiao. Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1

  46. [47]

    A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

    Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Han- jun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025. 1

  47. [48]

    Stand-guard: A small task-adaptive content moderation model.arXiv preprint arXiv:2411.05214, 2024

    Minjia Wang, Pingping Lin, Siqi Cai, Shengnan An, Shengjie Ma, Zeqi Lin, Congrui Huang, and Bixiong Xu. Stand-guard: A small task-adaptive content moderation model.arXiv preprint arXiv:2411.05214, 2024. 1

  48. [49]

    Genbreak: Red teaming text-to-image generators using large language models.arXiv preprint arXiv:2506.10047, 2025

    Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma, and Yu-Gang Jiang. Genbreak: Red teaming text-to-image generators using large language models.arXiv preprint arXiv:2506.10047, 2025. 13

  49. [50]

    Sociotechnical safety evaluation of generative ai systems,

    Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, et al. So- ciotechnical safety evaluation of generative ai systems.arXiv preprint arXiv:2310.11986, 2023. 2

  50. [51]

    Tas- tle: Distract large language models for automatic jailbreak attack.CoRR, 2024

    Zeguan Xiao, Yan Yang, Guanhua Chen, and Yun Chen. Tas- tle: Distract large language models for automatic jailbreak attack.CoRR, 2024. 2

  51. [53]

    Redagent: Red teaming large language models with context-aware au- tonomous language agent.arXiv preprint arXiv:2407.16667,

    Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, and Kui Ren. Redagent: Red teaming large language models with context-aware au- tonomous language agent.arXiv preprint arXiv:2407.16667,

  52. [54]

    Chain of attack: a semantic-driven contextual multi-turn at- tacker for llm.ArXiv, abs/2405.05610, 2024

    Xikang Yang, Xuehai Tang, Songlin Hu, and Jizhong Han. Chain of attack: a semantic-driven contextual multi-turn at- tacker for llm.ArXiv, abs/2405.05610, 2024. 5, 6

  53. [55]

    Reasoning- augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054,

    Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, and Dacheng Tao. Reasoning- augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054,

  54. [56]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    Jiahao Yu, Xingwei Lin, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023. 2

  55. [57]

    LLM-Virus: Evolutionary jailbreak attack on large language models, 2024

    Miao Yu, Junfeng Fang, Yingjie Zhou, Xing Fan, Kun Wang, Shirui Pan, and Qingsong Wen. LLM-Virus: Evolutionary jailbreak attack on large language models, 2024. 2

  56. [58]

    Word-level textual ad- versarial attacking as combinatorial optimization

    Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. Word-level textual ad- versarial attacking as combinatorial optimization. InPro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080, Online, 2020. Association for Computational Linguistics. 2

  57. [59]

    Ai risk cate- gorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024

    Yi Zeng, Kevin Klyman, Andy Zhou, Yu Yang, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. Ai risk cate- gorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024. 1

  58. [60]

    Air-bench 2024: A safety benchmark based on risk categories from regulations and policies, 2024

    Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. Air-bench 2024: A safety benchmark based on risk categories from regulations and policies, 2024. 1

  59. [61]

    Jbshield: Defending large language models from jailbreak attacks through activated concept anal- ysis and manipulation.arXiv preprint arXiv:2502.07557,

    Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Sheng- nan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. Jbshield: Defending large language models from jailbreak attacks through activated concept anal- ysis and manipulation.arXiv preprint arXiv:2502.07557,

  60. [62]

    Simulating classroom education with llm-empowered agents.arXiv preprint arXiv:2406.19226,

    Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jin- chang Zhou, Zhanxin Hao, Jianxiao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm-empowered agents.arXiv preprint arXiv:2406.19226,

  61. [63]

    On prompt- driven safeguarding for large language models

    Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt- driven safeguarding for large language models. InForty-first International Conference on Machine Learning, 2024. 1

  62. [64]

    Nguyen, Jun Sun, and Tat-Seng Chua

    Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, and Tat-Seng Chua. Ali-agent: Assessing llms’ align- 11 ment with human values via agent-based evaluation.ArXiv, abs/2405.14125, 2024. 2

  63. [65]

    ”Robust prompt optimization for defending language models against jailbreaking attacks.” arXiv preprint arXiv:2401.17263 (2024)

    Andy Zhou, Bo Li, and Haohan Wang. Robust prompt opti- mization for defending language models against jailbreaking attacks.arXiv preprint arXiv:2401.17263, 2024. 1

  64. [66]

    Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,

    Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, and Bo Li. Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,

  65. [67]

    Memento: Fine-tuning llm agents without fine-tuning llms.Preprint, 2025

    Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms.Preprint, 2025. 4

  66. [68]

    Easyjailbreak: A unified framework for jailbreaking large language models,

    Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Ying- shuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, et al. Easyjailbreak: A unified framework for jailbreaking large language models.arXiv preprint arXiv:2403.12171, 2024. 2

  67. [69]

    Systematic benchmarking of guardrails against prompt injection attacks.arXiv preprint arXiv:2502.15427, 2025

    Giovanni Zizzo et al. Systematic benchmarking of guardrails against prompt injection attacks.arXiv preprint arXiv:2502.15427, 2025. 3

  68. [70]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable ad- versarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 1, 2

  69. [71]

    A reinforced generation of adversarial examples for neural machine translation.arXiv preprint arXiv:1911.03677,

    Wei Zou, Shujian Huang, Jun Xie, Xinyu Dai, and Jiajun Chen. A reinforced generation of adversarial examples for neural machine translation.arXiv preprint arXiv:1911.03677,

  70. [72]

    chain-of-thought

    1, 2 12 A. Experimental Hyperparameters This section details the key hyperparameters used for all experiments conducted with the EvoSynth framework in Tab.5. Moreover, the Tools we use here could also incorporate more image-based attack tools allowing for multi-modal methods[8, 49]. Table 5. Core Hyperparameters for the EvoSynth Framework. Parameter Descr...

  71. [73]

    CONCEPT GENERATION: Create new AI attack concepts and detailed scenarios

  72. [74]

    INTELLIGENCE GATHERING: Categorize jailbreak methods into 5 main categories with implementation details

  73. [75]

    CONTEXT ANALYSIS: Access RunContext history to understand existing intelligence

  74. [76]

    When calling your function tools, provide YOUR actual analysis results as parameters

    CATEGORY INNOVATION: Create new jailbreak categories when necessary beyond the 5 main ones CRITICAL ANALYSIS REQUIREMENTS: You MUST perform actual analysis and thinking, not just follow predefined patterns. When calling your function tools, provide YOUR actual analysis results as parameters. Generate diverse, lengthy, and insightful content with detailed ...

  75. [77]

    CONTEXT ANALYSIS: Use access_runcontext_history to understand existing intelligence - Review previous concepts and intelligence gathered - Identify gaps in current knowledge and approaches - Ensure new concepts don’t conflict with existing ones - Build upon existing intelligence with new insights 19

  76. [78]

    CONCEPT GENERATION: Use create_new_ai_concepts to generate innovative attack scenarios - Create diverse attack concepts with detailed descriptions - Ensure concepts are original and don’t duplicate existing ones - Provide detailed scenarios and implementation approaches - Generate concepts that can be categorized into jailbreak methods

  77. [79]

    * Injection Attacks: Prompt injection, instruction hijacking, input manipulation

    INTELLIGENCE CATEGORIZATION: Use gather_jailbreak_intelligence to organize implementation methods - Categorize approaches into 5 main categories: "* Injection Attacks: Prompt injection, instruction hijacking, input manipulation "* Roleplay Attacks: Character-based attacks, persona manipulation, role-playing "* Structured & Iterative Prompting: Multi-step ...

  78. [80]

    You have completed ALL your required tasks

  79. [81]

    You have used ALL required tools at minimum specified times

  80. [82]

    You have provided YOUR final report

Showing first 80 references.