Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Jie Li; Juncheng Li; Xingjun Ma; Xin Wang; Yan Teng; Yingchun Wang; Yixu Wang; Yunhao Chen

arxiv: 2511.12710 · v2 · pith:LJBYYYBZnew · submitted 2025-11-16 · 💻 cs.CL · cs.CR

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Yunhao Chen , Xin Wang , Juncheng Li , Yixu Wang , Jie Li , Yan Teng , Yingchun Wang , Xingjun Ma This is my paper

Pith reviewed 2026-05-21 18:53 UTC · model grok-4.3

classification 💻 cs.CL cs.CR

keywords jailbreak attacksLLM red teamingevolutionary code synthesismulti-agent systemsautomated attack generationLLM securitycode-level optimization

0 comments

The pith

Optimizing executable code for jailbreak generation rather than prompt wording produces more effective and diverse attacks on LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that red-teaming methods limited to refining attack prompts miss an opportunity to improve the underlying procedure that creates those prompts. By instead evolving complete executable code that includes logic, branching, and repair steps, the approach can generate attacks that adapt to model feedback during development. A sympathetic reader would care because this shift could produce jailbreaks that transfer better across models and prove harder for static defenses to block, since the attack method itself improves over time rather than just the wording of any single input.

Core claim

The central claim is that moving optimization into code space via a multi-agent system that engineers, evolves, and self-corrects attack algorithms yields higher attack success rates and greater diversity than prompt-space methods. Experiments reported in the paper show the resulting attacks reach an 85.5 percent success rate on Claude-Sonnet-4.5 and average 95.9 percent across tested targets while producing outputs that differ more from one another than those generated by prior techniques.

What carries the argument

The multi-agent system that autonomously writes, executes, and iteratively rewrites executable code for generating jailbreak prompts in response to target-model feedback and failed attempts.

If this is right

Attack generation can optimize execution flow, reusable components, and failure-driven repair in addition to the final prompt text.
The resulting attacks transfer across multiple evaluated LLM targets at high success rates.
Generated attacks exhibit greater diversity than those produced by prompt-optimization baselines.
Releasing the framework supports further work on evolutionary methods that operate directly in executable code space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenders may eventually need to analyze or restrict dynamic code patterns rather than only static prompt patterns.
The same evolutionary approach could be tested on other optimization tasks such as generating adversarial examples in non-language domains.
Success may hinge on the initial agent capabilities or seed code quality, pointing to a possible direction for improving the multi-agent setup itself.

Load-bearing premise

That the self-correction process will reliably produce code that generalizes to new models and stays effective as defenses change.

What would settle it

A new large language model released after the experiments on which the evolved code-based attacks achieve substantially lower success rates than the reported averages would indicate the method does not produce reliably generalizable algorithms.

Figures

Figures reproduced from arXiv: 2511.12710 by Jie Li, Juncheng Li, Xingjun Ma, Xin Wang, Yan Teng, Yingchun Wang, Yixu Wang, Yunhao Chen.

**Figure 1.** Figure 1: An overview of our proposed EvoSynth method. The process begins with the Reconnaissance Agent formulating a strategy. The [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Diversity Comparison of Generated Attack Prompts. The raincloud plot shows the distribution of pairwise diversity scores for prompts from the X-Teaming dataset and those generated by EvoSynth. The wider distribution and higher median score for EvoSynth indicate that our framework synthesizes a more semantically diverse and non-redundant set of attacks. 4.3. Analysis of Attack Diversity and Case Study of … view at source ↗

**Figure 3.** Figure 3: Cumulative Convergence of Attack Success. The plots show the cumulative percentage of sessions that have achieved their highest score by a given point in time. (Left) Convergence by the tool’s code evolution iteration number. (Right) Convergence by the total number of agent actions taken in the session. Both plots demonstrate rapid convergence, with the majority of optimal attacks being discovered early in… view at source ↗

**Figure 4.** Figure 4: Cumulative Distribution of Attack Algorithm Transferability. The plot shows the cumulative percentage of all synthesized attack algorithms (y-axis) that meet or exceed a given usage percentage (x-axis). The curve demonstrates that while many algorithms are specialized, a significant portion are highly transferable, with 20% of all algorithms being effective enough to be used on over 80%. 4.6. Analysis o… view at source ↗

read the original abstract

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet many still formulate attack optimization primarily in the prompt space. In other words, these methods mainly search for better attack wording or better strategy choices, but they do not search over executable code. By moving the search into code space, we can optimize not only the final attack prompt, but also the procedure that generates it, including execution flow, reusable logic, branching, and failure-driven repair. To overcome this gap, we introduce EvoSynth, an autonomous multi-agent framework that shifts the optimization space from prompts to executable code. Instead of refining prompts directly, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite the code-based algorithm in response to target-model feedback and failed attempts. Through extensive experiments, we demonstrate that EvoSynth achieves an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5 and a 95.9\% average ASR across evaluated targets, while generating attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research on evolutionary synthesis in executable code space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces EvoSynth, an autonomous multi-agent framework that evolves executable code for generating jailbreak attacks on LLMs rather than optimizing prompts directly. It incorporates a code-level self-correction loop that rewrites attack algorithms based on target-model feedback. The central empirical claims are an 85.5% attack success rate (ASR) on Claude-Sonnet-4.5 and a 95.9% average ASR across evaluated targets, together with significantly higher attack diversity than prior methods. The framework is released to support further research.

Significance. If the performance and diversity claims are supported by rigorous controls, this work would advance automated red-teaming by shifting optimization from prompt space to executable code space, enabling procedural, repair-capable attack algorithms that may prove more transferable. The public release of the framework constitutes a concrete contribution to reproducibility.

major comments (2)

Abstract: the headline ASR figures (85.5% on Claude-Sonnet-4.5, 95.9% average) are presented without any description of the baseline methods, evaluation protocol, statistical tests, or controls for model-specific effects, so the magnitude and reliability of the claimed gains cannot be assessed from the given text.
Abstract / Experimental Results: the description supplies no indication of a held-out query partition or cross-model zero-shot transfer test for the final synthesized code. Without such safeguards, the reported success rates could arise from query-specific repair logic rather than generalizable attack procedures, directly undermining the central claim that the evolutionary synthesis produces broadly useful algorithms.

minor comments (1)

Abstract: the diversity claim is stated qualitatively ('significantly more diverse'); a brief definition of the diversity metric and quantitative comparison would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which highlight opportunities to strengthen the presentation and evaluation of our work. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: Abstract: the headline ASR figures (85.5% on Claude-Sonnet-4.5, 95.9% average) are presented without any description of the baseline methods, evaluation protocol, statistical tests, or controls for model-specific effects, so the magnitude and reliability of the claimed gains cannot be assessed from the given text.

Authors: We agree that the abstract is too high-level to convey the experimental context. The full manuscript details the baselines (including prompt- and code-based methods from prior work), the evaluation protocol (standard harmful query sets, ASR metric, multiple target models with controls for model-specific behaviors), and comparative analyses. To improve accessibility, we will revise the abstract to briefly note the comparison to existing methods and the multi-model evaluation setting. revision: yes
Referee: Abstract / Experimental Results: the description supplies no indication of a held-out query partition or cross-model zero-shot transfer test for the final synthesized code. Without such safeguards, the reported success rates could arise from query-specific repair logic rather than generalizable attack procedures, directly undermining the central claim that the evolutionary synthesis produces broadly useful algorithms.

Authors: This is a fair critique of the generalizability evidence. While the self-correction loop and procedural code design aim to produce reusable algorithms, and experiments cover diverse models and queries, the manuscript does not explicitly report held-out query partitions or zero-shot cross-model transfer results for the evolved code. We will add these analyses in the revision to directly support the claim of broadly useful algorithms. revision: yes

Circularity Check

0 steps flagged

Empirical results from evolutionary code synthesis framework show no circular derivation

full rationale

The paper introduces EvoSynth as a multi-agent evolutionary framework for synthesizing executable jailbreak code and reports experimental Attack Success Rates (e.g., 85.5% on Claude-Sonnet-4.5) obtained through direct testing. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps are present; the central claims rest on empirical evaluation rather than any reduction of outputs to inputs by construction. The work is self-contained against external benchmarks via reported ASR metrics and diversity comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The contribution rests on standard assumptions about attack success rate as a metric and the feasibility of code evolution for prompt generation; no free parameters or invented physical entities are introduced.

invented entities (1)

EvoSynth multi-agent code-evolution framework no independent evidence
purpose: Autonomously engineer and evolve executable attack algorithms with self-correction
The framework is the primary new artifact introduced to move optimization into code space.

pith-pipeline@v0.9.0 · 5786 in / 1105 out tokens · 49787 ms · 2026-05-21T18:53:59.074984+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute code-based attack algorithms... code-level self-correction loop
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model the sequential decision-making process of EvoSynth as a structured trajectory generation task... soft Q-learning Bellman backup

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

TAME uses a Mixture-of-Experts prompt bank with input-dependent routing and three unsupervised objectives to adaptively defend CLIP against adversarial attacks at inference time, achieving at least 49.1% robustness ga...

Reference graph

Works this paper leans on

142 extracted references · 142 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Introducing Claude, 2023

Anthropic. Introducing Claude, 2023. 5

work page 2023
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Con- stitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,

Joseph Biden. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,

work page
[4]

A realistic threat model for large language model jailbreaks

Valentyn Boreiko, Alexander Panfilov, Vaclav V oracek, Matthias Hein, and Jonas Geiping. A realistic threat model for large language model jailbreaks. InNeurIPS, 2024. 1

work page 2024
[5]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023. 1, 2, 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Flo- rian Tramer, et al. Jailbreakbench: An open robustness bench- mark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

When llm meets drl: Advancing jailbreaking efficiency via drl-guided search, 2024

Xuan Chen, Yuzhou Nie, Wenbo Guo, and Xiangyu Zhang. When llm meets drl: Advancing jailbreaking efficiency via drl-guided search, 2024. 2

work page 2024
[8]

Extracting training data from unconditional diffusion models

Yunhao Chen, Shujie Wang, Difan Zou, and Xingjun Ma. Extracting training data from unconditional diffusion models. arXiv preprint arXiv:2410.02467, 2024. 13

work page arXiv 2024
[9]

DeepSeek-AI, Xiao Bi, Deli Chen, Guanting Chen, Shan- huang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wen- jun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A...

work page 2024
[10]

h4rm3l: A language for composable jailbreak attack synthesis.arXiv preprint arXiv:2408.04811, 2024

Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, and Christopher D Manning. h4rm3l: A language for composable jailbreak attack synthesis.arXiv preprint arXiv:2408.04811, 2024. 2

work page arXiv 2024
[11]

Llama guard 3-1b-int4: Compact and 9 efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024

Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, et al. Llama guard 3-1b-int4: Compact and 9 efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024. 3

work page arXiv 2024
[12]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming lan- guage models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Mart: Improving llm safety with multi-round automatic red-teaming,

Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi- Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. Mart: Improving llm safety with multi-round automatic red-teaming,

work page
[14]

Simon Geisler, Johannes C. Z. Gane, Alianda Lopez, Corina PÄˇCtraÈ ´Zcanu, Paul-Ambroise Duquenne, Thomas Hofmann, and V olkan Cevher. Attacking large language models with projected gradient descent, 2024. 2

work page 2024
[15]

K. H. Hung et al. Attention tracker: Detecting prompt injec- tion attacks in llms. InFindings of NAACL, 2025. 3

work page 2025
[16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Codeattack: Code-based adversarial attacks for pre-trained programming language models

Akshita Jha and Chandan K Reddy. Codeattack: Code-based adversarial attacks for pre-trained programming language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14892–14900, 2023. 2, 5, 6

work page 2023
[18]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024. 2

work page 2024
[19]

Red queen: Safe- guarding large language models against concealed multi-turn jailbreaking.arXiv preprint arXiv:2409.17458, 2024

Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara, and Subhabrata Mukherjee. Red queen: Safe- guarding large language models against concealed multi-turn jailbreaking.arXiv preprint arXiv:2409.17458, 2024. 5, 6

work page arXiv 2024
[20]

Open sesame! universal black box jailbreaking of large language models

Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446, 2023. 2

work page arXiv 2023
[21]

A new generation of perspective api: Efficient multilingual character-level trans- formers

Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. A new generation of perspective api: Efficient multilingual character-level trans- formers. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3197–3207,

work page
[22]

Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024. 1

work page arXiv 2024
[23]

Llm defenses are not robust to multi-turn human jailbreaks yet.ArXiv, 2024

Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. Llm defenses are not robust to multi-turn human jailbreaks yet.ArXiv, 2024. 2

work page 2024
[24]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2410.05295 (2024)

Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Z. Morley Mao, Somesh Jha, Patrick Drew McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms.ArXiv, abs/2410.05295, 2024. 2, 6

work page arXiv 2024
[27]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jail- breaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model safety.arXiv preprint arXiv:2502.05206, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

A holistic approach to undesired con- tent detection in the real world

Todor Markov, Chong Zhang, Sandhini Agarwal, Floren- tine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired con- tent detection in the real world. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15009–15018,

work page
[30]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Tree of attacks: Jailbreaking black-box llms automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023. 1, 2, 5, 6

work page arXiv 2023
[32]

Introducing meta llama 3, 2024

Meta. Introducing meta llama 3, 2024. 5

work page 2024
[33]

Fight back against jailbreaking via prompt adversarial tuning

Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1

work page 2024
[34]

Nvidia nemo guardrails, 2024

NVIDIA. Nvidia nemo guardrails, 2024. 3

work page 2024
[35]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024. 2, 5

work page 2024
[36]

Evo-marl: Co-evolutionary multi-agent reinforcement learning for internalized safety

Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S Yu, Manling Li, et al. Evo-marl: Co-evolutionary multi-agent reinforcement learning for internalized safety. arXiv preprint arXiv:2508.03864, 2025. 2

work page arXiv 2025
[37]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Ro- man Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

work page 2025
[39]

X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents.arXiv preprint arXiv:2504.13203, 2025

Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel. X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents.arXiv preprint arXiv:2504.13203, 2025. 1, 3, 5, 6

work page arXiv 2025
[40]

Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues.arXiv preprint arXiv:2410.10700, 2024

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues.arXiv preprint arXiv:2410.10700, 2024. 5, 6

work page arXiv 2024
[41]

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack.ArXiv, abs/2404.01833, 2024. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Rain- bow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024

Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rain- bow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024. 1, 3, 5, 6

work page 2024
[43]

Adversarial attacks and defenses in large lan- guage models: Old and new threats

Leo Schwinn, David Dobre, Stephan Günnemann, and Gau- thier Gidel. Adversarial attacks and defenses in large lan- guage models: Old and new threats. InProceedings on, pages 103–117. PMLR, 2023. 1

work page 2023
[44]

L1B3RT45: Jailbreaks for All Flagship AI Models, 2024

Pliny the Prompter. L1B3RT45: Jailbreaks for All Flagship AI Models, 2024. 2

work page 2024
[45]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment

Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, and Chaowei Xiao. Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1

work page 2024
[47]

A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Han- jun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025. 1

work page arXiv 2025
[48]

Stand-guard: A small task-adaptive content moderation model.arXiv preprint arXiv:2411.05214, 2024

Minjia Wang, Pingping Lin, Siqi Cai, Shengnan An, Shengjie Ma, Zeqi Lin, Congrui Huang, and Bixiong Xu. Stand-guard: A small task-adaptive content moderation model.arXiv preprint arXiv:2411.05214, 2024. 1

work page arXiv 2024
[49]

Genbreak: Red teaming text-to-image generators using large language models.arXiv preprint arXiv:2506.10047, 2025

Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma, and Yu-Gang Jiang. Genbreak: Red teaming text-to-image generators using large language models.arXiv preprint arXiv:2506.10047, 2025. 13

work page arXiv 2025
[50]

Sociotechnical safety evaluation of generative ai systems,

Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, et al. So- ciotechnical safety evaluation of generative ai systems.arXiv preprint arXiv:2310.11986, 2023. 2

work page arXiv 2023
[51]

Tas- tle: Distract large language models for automatic jailbreak attack.CoRR, 2024

Zeguan Xiao, Yan Yang, Guanhua Chen, and Yun Chen. Tas- tle: Distract large language models for automatic jailbreak attack.CoRR, 2024. 2

work page 2024
[53]

Redagent: Red teaming large language models with context-aware au- tonomous language agent.arXiv preprint arXiv:2407.16667,

Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, and Kui Ren. Redagent: Red teaming large language models with context-aware au- tonomous language agent.arXiv preprint arXiv:2407.16667,

work page arXiv
[54]

Chain of attack: a semantic-driven contextual multi-turn at- tacker for llm.ArXiv, abs/2405.05610, 2024

Xikang Yang, Xuehai Tang, Songlin Hu, and Jizhong Han. Chain of attack: a semantic-driven contextual multi-turn at- tacker for llm.ArXiv, abs/2405.05610, 2024. 5, 6

work page arXiv 2024
[55]

Reasoning- augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054,

Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, and Dacheng Tao. Reasoning- augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054,

work page arXiv
[56]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

LLM-Virus: Evolutionary jailbreak attack on large language models, 2024

Miao Yu, Junfeng Fang, Yingjie Zhou, Xing Fan, Kun Wang, Shirui Pan, and Qingsong Wen. LLM-Virus: Evolutionary jailbreak attack on large language models, 2024. 2

work page 2024
[58]

Word-level textual ad- versarial attacking as combinatorial optimization

Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. Word-level textual ad- versarial attacking as combinatorial optimization. InPro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080, Online, 2020. Association for Computational Linguistics. 2

work page 2020
[59]

Ai risk cate- gorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024

Yi Zeng, Kevin Klyman, Andy Zhou, Yu Yang, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. Ai risk cate- gorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024. 1

work page arXiv 2024
[60]

Air-bench 2024: A safety benchmark based on risk categories from regulations and policies, 2024

Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. Air-bench 2024: A safety benchmark based on risk categories from regulations and policies, 2024. 1

work page 2024
[61]

Jbshield: Defending large language models from jailbreak attacks through activated concept anal- ysis and manipulation.arXiv preprint arXiv:2502.07557,

Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Sheng- nan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. Jbshield: Defending large language models from jailbreak attacks through activated concept anal- ysis and manipulation.arXiv preprint arXiv:2502.07557,

work page arXiv
[62]

Simulating classroom education with llm-empowered agents.arXiv preprint arXiv:2406.19226,

Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jin- chang Zhou, Zhanxin Hao, Jianxiao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm-empowered agents.arXiv preprint arXiv:2406.19226,

work page arXiv
[63]

On prompt- driven safeguarding for large language models

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt- driven safeguarding for large language models. InForty-first International Conference on Machine Learning, 2024. 1

work page 2024
[64]

Nguyen, Jun Sun, and Tat-Seng Chua

Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, and Tat-Seng Chua. Ali-agent: Assessing llms’ align- 11 ment with human values via agent-based evaluation.ArXiv, abs/2405.14125, 2024. 2

work page arXiv 2024
[65]

”Robust prompt optimization for defending language models against jailbreaking attacks.” arXiv preprint arXiv:2401.17263 (2024)

Andy Zhou, Bo Li, and Haohan Wang. Robust prompt opti- mization for defending language models against jailbreaking attacks.arXiv preprint arXiv:2401.17263, 2024. 1

work page arXiv 2024
[66]

Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,

Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, and Bo Li. Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,

work page arXiv
[67]

Memento: Fine-tuning llm agents without fine-tuning llms.Preprint, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms.Preprint, 2025. 4

work page 2025
[68]

Easyjailbreak: A unified framework for jailbreaking large language models,

Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Ying- shuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, et al. Easyjailbreak: A unified framework for jailbreaking large language models.arXiv preprint arXiv:2403.12171, 2024. 2

work page arXiv 2024
[69]

Systematic benchmarking of guardrails against prompt injection attacks.arXiv preprint arXiv:2502.15427, 2025

Giovanni Zizzo et al. Systematic benchmarking of guardrails against prompt injection attacks.arXiv preprint arXiv:2502.15427, 2025. 3

work page arXiv 2025
[70]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable ad- versarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

A reinforced generation of adversarial examples for neural machine translation.arXiv preprint arXiv:1911.03677,

Wei Zou, Shujian Huang, Jun Xie, Xinyu Dai, and Jiajun Chen. A reinforced generation of adversarial examples for neural machine translation.arXiv preprint arXiv:1911.03677,

work page arXiv 1911
[72]

chain-of-thought

1, 2 12 A. Experimental Hyperparameters This section details the key hyperparameters used for all experiments conducted with the EvoSynth framework in Tab.5. Moreover, the Tools we use here could also incorporate more image-based attack tools allowing for multi-modal methods[8, 49]. Table 5. Core Hyperparameters for the EvoSynth Framework. Parameter Descr...

work page
[73]

CONCEPT GENERATION: Create new AI attack concepts and detailed scenarios

work page
[74]

INTELLIGENCE GATHERING: Categorize jailbreak methods into 5 main categories with implementation details

work page
[75]

CONTEXT ANALYSIS: Access RunContext history to understand existing intelligence

work page
[76]

When calling your function tools, provide YOUR actual analysis results as parameters

CATEGORY INNOVATION: Create new jailbreak categories when necessary beyond the 5 main ones CRITICAL ANALYSIS REQUIREMENTS: You MUST perform actual analysis and thinking, not just follow predefined patterns. When calling your function tools, provide YOUR actual analysis results as parameters. Generate diverse, lengthy, and insightful content with detailed ...

work page
[77]

CONTEXT ANALYSIS: Use access_runcontext_history to understand existing intelligence - Review previous concepts and intelligence gathered - Identify gaps in current knowledge and approaches - Ensure new concepts don’t conflict with existing ones - Build upon existing intelligence with new insights 19

work page
[78]

CONCEPT GENERATION: Use create_new_ai_concepts to generate innovative attack scenarios - Create diverse attack concepts with detailed descriptions - Ensure concepts are original and don’t duplicate existing ones - Provide detailed scenarios and implementation approaches - Generate concepts that can be categorized into jailbreak methods

work page
[79]

* Injection Attacks: Prompt injection, instruction hijacking, input manipulation

INTELLIGENCE CATEGORIZATION: Use gather_jailbreak_intelligence to organize implementation methods - Categorize approaches into 5 main categories: "* Injection Attacks: Prompt injection, instruction hijacking, input manipulation "* Roleplay Attacks: Character-based attacks, persona manipulation, role-playing "* Structured & Iterative Prompting: Multi-step ...

work page
[80]

You have completed ALL your required tasks

work page
[81]

You have used ALL required tools at minimum specified times

work page
[82]

You have provided YOUR final report

work page

Showing first 80 references.

[1] [1]

Introducing Claude, 2023

Anthropic. Introducing Claude, 2023. 5

work page 2023

[2] [2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Con- stitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,

Joseph Biden. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,

work page

[4] [4]

A realistic threat model for large language model jailbreaks

Valentyn Boreiko, Alexander Panfilov, Vaclav V oracek, Matthias Hein, and Jonas Geiping. A realistic threat model for large language model jailbreaks. InNeurIPS, 2024. 1

work page 2024

[5] [5]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023. 1, 2, 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Flo- rian Tramer, et al. Jailbreakbench: An open robustness bench- mark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

When llm meets drl: Advancing jailbreaking efficiency via drl-guided search, 2024

Xuan Chen, Yuzhou Nie, Wenbo Guo, and Xiangyu Zhang. When llm meets drl: Advancing jailbreaking efficiency via drl-guided search, 2024. 2

work page 2024

[8] [8]

Extracting training data from unconditional diffusion models

Yunhao Chen, Shujie Wang, Difan Zou, and Xingjun Ma. Extracting training data from unconditional diffusion models. arXiv preprint arXiv:2410.02467, 2024. 13

work page arXiv 2024

[9] [9]

DeepSeek-AI, Xiao Bi, Deli Chen, Guanting Chen, Shan- huang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wen- jun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A...

work page 2024

[10] [10]

h4rm3l: A language for composable jailbreak attack synthesis.arXiv preprint arXiv:2408.04811, 2024

Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, and Christopher D Manning. h4rm3l: A language for composable jailbreak attack synthesis.arXiv preprint arXiv:2408.04811, 2024. 2

work page arXiv 2024

[11] [11]

Llama guard 3-1b-int4: Compact and 9 efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024

Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, et al. Llama guard 3-1b-int4: Compact and 9 efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024. 3

work page arXiv 2024

[12] [12]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming lan- guage models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Mart: Improving llm safety with multi-round automatic red-teaming,

Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi- Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. Mart: Improving llm safety with multi-round automatic red-teaming,

work page

[14] [14]

Simon Geisler, Johannes C. Z. Gane, Alianda Lopez, Corina PÄˇCtraÈ ´Zcanu, Paul-Ambroise Duquenne, Thomas Hofmann, and V olkan Cevher. Attacking large language models with projected gradient descent, 2024. 2

work page 2024

[15] [15]

K. H. Hung et al. Attention tracker: Detecting prompt injec- tion attacks in llms. InFindings of NAACL, 2025. 3

work page 2025

[16] [16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Codeattack: Code-based adversarial attacks for pre-trained programming language models

Akshita Jha and Chandan K Reddy. Codeattack: Code-based adversarial attacks for pre-trained programming language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14892–14900, 2023. 2, 5, 6

work page 2023

[18] [18]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024. 2

work page 2024

[19] [19]

Red queen: Safe- guarding large language models against concealed multi-turn jailbreaking.arXiv preprint arXiv:2409.17458, 2024

Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara, and Subhabrata Mukherjee. Red queen: Safe- guarding large language models against concealed multi-turn jailbreaking.arXiv preprint arXiv:2409.17458, 2024. 5, 6

work page arXiv 2024

[20] [20]

Open sesame! universal black box jailbreaking of large language models

Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446, 2023. 2

work page arXiv 2023

[21] [21]

A new generation of perspective api: Efficient multilingual character-level trans- formers

Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. A new generation of perspective api: Efficient multilingual character-level trans- formers. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3197–3207,

work page

[22] [22]

Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024. 1

work page arXiv 2024

[23] [23]

Llm defenses are not robust to multi-turn human jailbreaks yet.ArXiv, 2024

Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. Llm defenses are not robust to multi-turn human jailbreaks yet.ArXiv, 2024. 2

work page 2024

[24] [24]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [26]

arXiv preprint arXiv:2410.05295 (2024)

Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Z. Morley Mao, Somesh Jha, Patrick Drew McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms.ArXiv, abs/2410.05295, 2024. 2, 6

work page arXiv 2024

[26] [27]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jail- breaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [28]

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model safety.arXiv preprint arXiv:2502.05206, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [29]

A holistic approach to undesired con- tent detection in the real world

Todor Markov, Chong Zhang, Sandhini Agarwal, Floren- tine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired con- tent detection in the real world. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15009–15018,

work page

[29] [30]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [31]

Tree of attacks: Jailbreaking black-box llms automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023. 1, 2, 5, 6

work page arXiv 2023

[31] [32]

Introducing meta llama 3, 2024

Meta. Introducing meta llama 3, 2024. 5

work page 2024

[32] [33]

Fight back against jailbreaking via prompt adversarial tuning

Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1

work page 2024

[33] [34]

Nvidia nemo guardrails, 2024

NVIDIA. Nvidia nemo guardrails, 2024. 3

work page 2024

[34] [35]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024. 2, 5

work page 2024

[35] [36]

Evo-marl: Co-evolutionary multi-agent reinforcement learning for internalized safety

Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S Yu, Manling Li, et al. Evo-marl: Co-evolutionary multi-agent reinforcement learning for internalized safety. arXiv preprint arXiv:2508.03864, 2025. 2

work page arXiv 2025

[36] [37]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Ro- man Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [38]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

work page 2025

[38] [39]

X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents.arXiv preprint arXiv:2504.13203, 2025

Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel. X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents.arXiv preprint arXiv:2504.13203, 2025. 1, 3, 5, 6

work page arXiv 2025

[39] [40]

Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues.arXiv preprint arXiv:2410.10700, 2024

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues.arXiv preprint arXiv:2410.10700, 2024. 5, 6

work page arXiv 2024

[40] [41]

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack.ArXiv, abs/2404.01833, 2024. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [42]

Rain- bow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024

Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rain- bow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024. 1, 3, 5, 6

work page 2024

[42] [43]

Adversarial attacks and defenses in large lan- guage models: Old and new threats

Leo Schwinn, David Dobre, Stephan Günnemann, and Gau- thier Gidel. Adversarial attacks and defenses in large lan- guage models: Old and new threats. InProceedings on, pages 103–117. PMLR, 2023. 1

work page 2023

[43] [44]

L1B3RT45: Jailbreaks for All Flagship AI Models, 2024

Pliny the Prompter. L1B3RT45: Jailbreaks for All Flagship AI Models, 2024. 2

work page 2024

[44] [45]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [46]

Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment

Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, and Chaowei Xiao. Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1

work page 2024

[46] [47]

A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Han- jun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025. 1

work page arXiv 2025

[47] [48]

Stand-guard: A small task-adaptive content moderation model.arXiv preprint arXiv:2411.05214, 2024

Minjia Wang, Pingping Lin, Siqi Cai, Shengnan An, Shengjie Ma, Zeqi Lin, Congrui Huang, and Bixiong Xu. Stand-guard: A small task-adaptive content moderation model.arXiv preprint arXiv:2411.05214, 2024. 1

work page arXiv 2024

[48] [49]

Genbreak: Red teaming text-to-image generators using large language models.arXiv preprint arXiv:2506.10047, 2025

Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma, and Yu-Gang Jiang. Genbreak: Red teaming text-to-image generators using large language models.arXiv preprint arXiv:2506.10047, 2025. 13

work page arXiv 2025

[49] [50]

Sociotechnical safety evaluation of generative ai systems,

Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, et al. So- ciotechnical safety evaluation of generative ai systems.arXiv preprint arXiv:2310.11986, 2023. 2

work page arXiv 2023

[50] [51]

Tas- tle: Distract large language models for automatic jailbreak attack.CoRR, 2024

Zeguan Xiao, Yan Yang, Guanhua Chen, and Yun Chen. Tas- tle: Distract large language models for automatic jailbreak attack.CoRR, 2024. 2

work page 2024

[51] [53]

Redagent: Red teaming large language models with context-aware au- tonomous language agent.arXiv preprint arXiv:2407.16667,

Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, and Kui Ren. Redagent: Red teaming large language models with context-aware au- tonomous language agent.arXiv preprint arXiv:2407.16667,

work page arXiv

[52] [54]

Chain of attack: a semantic-driven contextual multi-turn at- tacker for llm.ArXiv, abs/2405.05610, 2024

Xikang Yang, Xuehai Tang, Songlin Hu, and Jizhong Han. Chain of attack: a semantic-driven contextual multi-turn at- tacker for llm.ArXiv, abs/2405.05610, 2024. 5, 6

work page arXiv 2024

[53] [55]

Reasoning- augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054,

Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, and Dacheng Tao. Reasoning- augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054,

work page arXiv

[54] [56]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [57]

LLM-Virus: Evolutionary jailbreak attack on large language models, 2024

Miao Yu, Junfeng Fang, Yingjie Zhou, Xing Fan, Kun Wang, Shirui Pan, and Qingsong Wen. LLM-Virus: Evolutionary jailbreak attack on large language models, 2024. 2

work page 2024

[56] [58]

Word-level textual ad- versarial attacking as combinatorial optimization

Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. Word-level textual ad- versarial attacking as combinatorial optimization. InPro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080, Online, 2020. Association for Computational Linguistics. 2

work page 2020

[57] [59]

Ai risk cate- gorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024

Yi Zeng, Kevin Klyman, Andy Zhou, Yu Yang, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. Ai risk cate- gorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024. 1

work page arXiv 2024

[58] [60]

Air-bench 2024: A safety benchmark based on risk categories from regulations and policies, 2024

Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. Air-bench 2024: A safety benchmark based on risk categories from regulations and policies, 2024. 1

work page 2024

[59] [61]

Jbshield: Defending large language models from jailbreak attacks through activated concept anal- ysis and manipulation.arXiv preprint arXiv:2502.07557,

Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Sheng- nan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. Jbshield: Defending large language models from jailbreak attacks through activated concept anal- ysis and manipulation.arXiv preprint arXiv:2502.07557,

work page arXiv

[60] [62]

Simulating classroom education with llm-empowered agents.arXiv preprint arXiv:2406.19226,

Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jin- chang Zhou, Zhanxin Hao, Jianxiao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm-empowered agents.arXiv preprint arXiv:2406.19226,

work page arXiv

[61] [63]

On prompt- driven safeguarding for large language models

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt- driven safeguarding for large language models. InForty-first International Conference on Machine Learning, 2024. 1

work page 2024

[62] [64]

Nguyen, Jun Sun, and Tat-Seng Chua

Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, and Tat-Seng Chua. Ali-agent: Assessing llms’ align- 11 ment with human values via agent-based evaluation.ArXiv, abs/2405.14125, 2024. 2

work page arXiv 2024

[63] [65]

”Robust prompt optimization for defending language models against jailbreaking attacks.” arXiv preprint arXiv:2401.17263 (2024)

Andy Zhou, Bo Li, and Haohan Wang. Robust prompt opti- mization for defending language models against jailbreaking attacks.arXiv preprint arXiv:2401.17263, 2024. 1

work page arXiv 2024

[64] [66]

Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,

Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, and Bo Li. Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,

work page arXiv

[65] [67]

Memento: Fine-tuning llm agents without fine-tuning llms.Preprint, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms.Preprint, 2025. 4

work page 2025

[66] [68]

Easyjailbreak: A unified framework for jailbreaking large language models,

Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Ying- shuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, et al. Easyjailbreak: A unified framework for jailbreaking large language models.arXiv preprint arXiv:2403.12171, 2024. 2

work page arXiv 2024

[67] [69]

Systematic benchmarking of guardrails against prompt injection attacks.arXiv preprint arXiv:2502.15427, 2025

Giovanni Zizzo et al. Systematic benchmarking of guardrails against prompt injection attacks.arXiv preprint arXiv:2502.15427, 2025. 3

work page arXiv 2025

[68] [70]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable ad- versarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [71]

A reinforced generation of adversarial examples for neural machine translation.arXiv preprint arXiv:1911.03677,

Wei Zou, Shujian Huang, Jun Xie, Xinyu Dai, and Jiajun Chen. A reinforced generation of adversarial examples for neural machine translation.arXiv preprint arXiv:1911.03677,

work page arXiv 1911

[70] [72]

chain-of-thought

1, 2 12 A. Experimental Hyperparameters This section details the key hyperparameters used for all experiments conducted with the EvoSynth framework in Tab.5. Moreover, the Tools we use here could also incorporate more image-based attack tools allowing for multi-modal methods[8, 49]. Table 5. Core Hyperparameters for the EvoSynth Framework. Parameter Descr...

work page

[71] [73]

CONCEPT GENERATION: Create new AI attack concepts and detailed scenarios

work page

[72] [74]

INTELLIGENCE GATHERING: Categorize jailbreak methods into 5 main categories with implementation details

work page

[73] [75]

CONTEXT ANALYSIS: Access RunContext history to understand existing intelligence

work page

[74] [76]

When calling your function tools, provide YOUR actual analysis results as parameters

CATEGORY INNOVATION: Create new jailbreak categories when necessary beyond the 5 main ones CRITICAL ANALYSIS REQUIREMENTS: You MUST perform actual analysis and thinking, not just follow predefined patterns. When calling your function tools, provide YOUR actual analysis results as parameters. Generate diverse, lengthy, and insightful content with detailed ...

work page

[75] [77]

CONTEXT ANALYSIS: Use access_runcontext_history to understand existing intelligence - Review previous concepts and intelligence gathered - Identify gaps in current knowledge and approaches - Ensure new concepts don’t conflict with existing ones - Build upon existing intelligence with new insights 19

work page

[76] [78]

CONCEPT GENERATION: Use create_new_ai_concepts to generate innovative attack scenarios - Create diverse attack concepts with detailed descriptions - Ensure concepts are original and don’t duplicate existing ones - Provide detailed scenarios and implementation approaches - Generate concepts that can be categorized into jailbreak methods

work page

[77] [79]

* Injection Attacks: Prompt injection, instruction hijacking, input manipulation

INTELLIGENCE CATEGORIZATION: Use gather_jailbreak_intelligence to organize implementation methods - Categorize approaches into 5 main categories: "* Injection Attacks: Prompt injection, instruction hijacking, input manipulation "* Roleplay Attacks: Character-based attacks, persona manipulation, role-playing "* Structured & Iterative Prompting: Multi-step ...

work page

[78] [80]

You have completed ALL your required tasks

work page

[79] [81]

You have used ALL required tools at minimum specified times

work page

[80] [82]

You have provided YOUR final report

work page