Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Pith reviewed 2026-05-21 18:53 UTC · model grok-4.3
The pith
Optimizing executable code for jailbreak generation rather than prompt wording produces more effective and diverse attacks on LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that moving optimization into code space via a multi-agent system that engineers, evolves, and self-corrects attack algorithms yields higher attack success rates and greater diversity than prompt-space methods. Experiments reported in the paper show the resulting attacks reach an 85.5 percent success rate on Claude-Sonnet-4.5 and average 95.9 percent across tested targets while producing outputs that differ more from one another than those generated by prior techniques.
What carries the argument
The multi-agent system that autonomously writes, executes, and iteratively rewrites executable code for generating jailbreak prompts in response to target-model feedback and failed attempts.
If this is right
- Attack generation can optimize execution flow, reusable components, and failure-driven repair in addition to the final prompt text.
- The resulting attacks transfer across multiple evaluated LLM targets at high success rates.
- Generated attacks exhibit greater diversity than those produced by prompt-optimization baselines.
- Releasing the framework supports further work on evolutionary methods that operate directly in executable code space.
Where Pith is reading between the lines
- Defenders may eventually need to analyze or restrict dynamic code patterns rather than only static prompt patterns.
- The same evolutionary approach could be tested on other optimization tasks such as generating adversarial examples in non-language domains.
- Success may hinge on the initial agent capabilities or seed code quality, pointing to a possible direction for improving the multi-agent setup itself.
Load-bearing premise
That the self-correction process will reliably produce code that generalizes to new models and stays effective as defenses change.
What would settle it
A new large language model released after the experiments on which the evolved code-based attacks achieve substantially lower success rates than the reported averages would indicate the method does not produce reliably generalizable algorithms.
Figures
read the original abstract
Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet many still formulate attack optimization primarily in the prompt space. In other words, these methods mainly search for better attack wording or better strategy choices, but they do not search over executable code. By moving the search into code space, we can optimize not only the final attack prompt, but also the procedure that generates it, including execution flow, reusable logic, branching, and failure-driven repair. To overcome this gap, we introduce EvoSynth, an autonomous multi-agent framework that shifts the optimization space from prompts to executable code. Instead of refining prompts directly, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite the code-based algorithm in response to target-model feedback and failed attempts. Through extensive experiments, we demonstrate that EvoSynth achieves an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5 and a 95.9\% average ASR across evaluated targets, while generating attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research on evolutionary synthesis in executable code space.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EvoSynth, an autonomous multi-agent framework that evolves executable code for generating jailbreak attacks on LLMs rather than optimizing prompts directly. It incorporates a code-level self-correction loop that rewrites attack algorithms based on target-model feedback. The central empirical claims are an 85.5% attack success rate (ASR) on Claude-Sonnet-4.5 and a 95.9% average ASR across evaluated targets, together with significantly higher attack diversity than prior methods. The framework is released to support further research.
Significance. If the performance and diversity claims are supported by rigorous controls, this work would advance automated red-teaming by shifting optimization from prompt space to executable code space, enabling procedural, repair-capable attack algorithms that may prove more transferable. The public release of the framework constitutes a concrete contribution to reproducibility.
major comments (2)
- Abstract: the headline ASR figures (85.5% on Claude-Sonnet-4.5, 95.9% average) are presented without any description of the baseline methods, evaluation protocol, statistical tests, or controls for model-specific effects, so the magnitude and reliability of the claimed gains cannot be assessed from the given text.
- Abstract / Experimental Results: the description supplies no indication of a held-out query partition or cross-model zero-shot transfer test for the final synthesized code. Without such safeguards, the reported success rates could arise from query-specific repair logic rather than generalizable attack procedures, directly undermining the central claim that the evolutionary synthesis produces broadly useful algorithms.
minor comments (1)
- Abstract: the diversity claim is stated qualitatively ('significantly more diverse'); a brief definition of the diversity metric and quantitative comparison would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments, which highlight opportunities to strengthen the presentation and evaluation of our work. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: Abstract: the headline ASR figures (85.5% on Claude-Sonnet-4.5, 95.9% average) are presented without any description of the baseline methods, evaluation protocol, statistical tests, or controls for model-specific effects, so the magnitude and reliability of the claimed gains cannot be assessed from the given text.
Authors: We agree that the abstract is too high-level to convey the experimental context. The full manuscript details the baselines (including prompt- and code-based methods from prior work), the evaluation protocol (standard harmful query sets, ASR metric, multiple target models with controls for model-specific behaviors), and comparative analyses. To improve accessibility, we will revise the abstract to briefly note the comparison to existing methods and the multi-model evaluation setting. revision: yes
-
Referee: Abstract / Experimental Results: the description supplies no indication of a held-out query partition or cross-model zero-shot transfer test for the final synthesized code. Without such safeguards, the reported success rates could arise from query-specific repair logic rather than generalizable attack procedures, directly undermining the central claim that the evolutionary synthesis produces broadly useful algorithms.
Authors: This is a fair critique of the generalizability evidence. While the self-correction loop and procedural code design aim to produce reusable algorithms, and experiments cover diverse models and queries, the manuscript does not explicitly report held-out query partitions or zero-shot cross-model transfer results for the evolved code. We will add these analyses in the revision to directly support the claim of broadly useful algorithms. revision: yes
Circularity Check
Empirical results from evolutionary code synthesis framework show no circular derivation
full rationale
The paper introduces EvoSynth as a multi-agent evolutionary framework for synthesizing executable jailbreak code and reports experimental Attack Success Rates (e.g., 85.5% on Claude-Sonnet-4.5) obtained through direct testing. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps are present; the central claims rest on empirical evaluation rather than any reduction of outputs to inputs by construction. The work is self-contained against external benchmarks via reported ASR metrics and diversity comparisons.
Axiom & Free-Parameter Ledger
invented entities (1)
-
EvoSynth multi-agent code-evolution framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute code-based attack algorithms... code-level self-correction loop
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat ≃ Nat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We model the sequential decision-making process of EvoSynth as a structured trajectory generation task... soft Q-learning Bellman backup
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models
TAME uses a Mixture-of-Experts prompt bank with input-dependent routing and three unsupervised objectives to adaptively defend CLIP against adversarial attacks at inference time, achieving at least 49.1% robustness ga...
Reference graph
Works this paper leans on
- [1]
-
[2]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Con- stitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,
Joseph Biden. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,
-
[4]
A realistic threat model for large language model jailbreaks
Valentyn Boreiko, Alexander Panfilov, Vaclav V oracek, Matthias Hein, and Jonas Geiping. A realistic threat model for large language model jailbreaks. InNeurIPS, 2024. 1
work page 2024
-
[5]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023. 1, 2, 3, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Flo- rian Tramer, et al. Jailbreakbench: An open robustness bench- mark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
When llm meets drl: Advancing jailbreaking efficiency via drl-guided search, 2024
Xuan Chen, Yuzhou Nie, Wenbo Guo, and Xiangyu Zhang. When llm meets drl: Advancing jailbreaking efficiency via drl-guided search, 2024. 2
work page 2024
-
[8]
Extracting training data from unconditional diffusion models
Yunhao Chen, Shujie Wang, Difan Zou, and Xingjun Ma. Extracting training data from unconditional diffusion models. arXiv preprint arXiv:2410.02467, 2024. 13
-
[9]
DeepSeek-AI, Xiao Bi, Deli Chen, Guanting Chen, Shan- huang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wen- jun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A...
work page 2024
-
[10]
h4rm3l: A language for composable jailbreak attack synthesis.arXiv preprint arXiv:2408.04811, 2024
Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, and Christopher D Manning. h4rm3l: A language for composable jailbreak attack synthesis.arXiv preprint arXiv:2408.04811, 2024. 2
-
[11]
Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, et al. Llama guard 3-1b-int4: Compact and 9 efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024. 3
-
[12]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming lan- guage models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Mart: Improving llm safety with multi-round automatic red-teaming,
Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi- Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. Mart: Improving llm safety with multi-round automatic red-teaming,
-
[14]
Simon Geisler, Johannes C. Z. Gane, Alianda Lopez, Corina PġCtraÈ ´Zcanu, Paul-Ambroise Duquenne, Thomas Hofmann, and V olkan Cevher. Attacking large language models with projected gradient descent, 2024. 2
work page 2024
-
[15]
K. H. Hung et al. Attention tracker: Detecting prompt injec- tion attacks in llms. InFindings of NAACL, 2025. 3
work page 2025
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Codeattack: Code-based adversarial attacks for pre-trained programming language models
Akshita Jha and Chandan K Reddy. Codeattack: Code-based adversarial attacks for pre-trained programming language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14892–14900, 2023. 2, 5, 6
work page 2023
-
[18]
Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024. 2
work page 2024
-
[19]
Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara, and Subhabrata Mukherjee. Red queen: Safe- guarding large language models against concealed multi-turn jailbreaking.arXiv preprint arXiv:2409.17458, 2024. 5, 6
-
[20]
Open sesame! universal black box jailbreaking of large language models
Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446, 2023. 2
-
[21]
A new generation of perspective api: Efficient multilingual character-level trans- formers
Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. A new generation of perspective api: Efficient multilingual character-level trans- formers. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3197–3207,
-
[22]
Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models
Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024. 1
-
[23]
Llm defenses are not robust to multi-turn human jailbreaks yet.ArXiv, 2024
Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. Llm defenses are not robust to multi-turn human jailbreaks yet.ArXiv, 2024. 2
work page 2024
-
[24]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
arXiv preprint arXiv:2410.05295 (2024)
Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Z. Morley Mao, Somesh Jha, Patrick Drew McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms.ArXiv, abs/2410.05295, 2024. 2, 6
-
[27]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jail- breaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model safety.arXiv preprint arXiv:2502.05206, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
A holistic approach to undesired con- tent detection in the real world
Todor Markov, Chong Zhang, Sandhini Agarwal, Floren- tine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired con- tent detection in the real world. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15009–15018,
-
[30]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024. 1, 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Tree of attacks: Jailbreaking black-box llms automatically
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023. 1, 2, 5, 6
- [32]
-
[33]
Fight back against jailbreaking via prompt adversarial tuning
Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1
work page 2024
- [34]
- [35]
-
[36]
Evo-marl: Co-evolutionary multi-agent reinforcement learning for internalized safety
Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S Yu, Manling Li, et al. Evo-marl: Co-evolutionary multi-agent reinforcement learning for internalized safety. arXiv preprint arXiv:2508.03864, 2025. 2
-
[37]
Red Teaming Language Models with Language Models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Ro- man Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun- yang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...
work page 2025
-
[39]
Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel. X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents.arXiv preprint arXiv:2504.13203, 2025. 1, 3, 5, 6
-
[40]
Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues.arXiv preprint arXiv:2410.10700, 2024. 5, 6
-
[41]
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack.ArXiv, abs/2404.01833, 2024. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rain- bow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024. 1, 3, 5, 6
work page 2024
-
[43]
Adversarial attacks and defenses in large lan- guage models: Old and new threats
Leo Schwinn, David Dobre, Stephan Günnemann, and Gau- thier Gidel. Adversarial attacks and defenses in large lan- guage models: Old and new threats. InProceedings on, pages 103–117. PMLR, 2023. 1
work page 2023
-
[44]
L1B3RT45: Jailbreaks for All Flagship AI Models, 2024
Pliny the Prompter. L1B3RT45: Jailbreaks for All Flagship AI Models, 2024. 2
work page 2024
-
[45]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment
Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, and Chaowei Xiao. Backdooralign: Mitigating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1
work page 2024
-
[47]
Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Han- jun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025. 1
-
[48]
Stand-guard: A small task-adaptive content moderation model.arXiv preprint arXiv:2411.05214, 2024
Minjia Wang, Pingping Lin, Siqi Cai, Shengnan An, Shengjie Ma, Zeqi Lin, Congrui Huang, and Bixiong Xu. Stand-guard: A small task-adaptive content moderation model.arXiv preprint arXiv:2411.05214, 2024. 1
-
[49]
Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo Wang, Xingjun Ma, and Yu-Gang Jiang. Genbreak: Red teaming text-to-image generators using large language models.arXiv preprint arXiv:2506.10047, 2025. 13
-
[50]
Sociotechnical safety evaluation of generative ai systems,
Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, et al. So- ciotechnical safety evaluation of generative ai systems.arXiv preprint arXiv:2310.11986, 2023. 2
-
[51]
Tas- tle: Distract large language models for automatic jailbreak attack.CoRR, 2024
Zeguan Xiao, Yan Yang, Guanhua Chen, and Yun Chen. Tas- tle: Distract large language models for automatic jailbreak attack.CoRR, 2024. 2
work page 2024
-
[53]
Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, and Kui Ren. Redagent: Red teaming large language models with context-aware au- tonomous language agent.arXiv preprint arXiv:2407.16667,
-
[54]
Xikang Yang, Xuehai Tang, Songlin Hu, and Jizhong Han. Chain of attack: a semantic-driven contextual multi-turn at- tacker for llm.ArXiv, abs/2405.05610, 2024. 5, 6
-
[55]
Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, and Dacheng Tao. Reasoning- augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054,
-
[56]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu, Xingwei Lin, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
LLM-Virus: Evolutionary jailbreak attack on large language models, 2024
Miao Yu, Junfeng Fang, Yingjie Zhou, Xing Fan, Kun Wang, Shirui Pan, and Qingsong Wen. LLM-Virus: Evolutionary jailbreak attack on large language models, 2024. 2
work page 2024
-
[58]
Word-level textual ad- versarial attacking as combinatorial optimization
Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. Word-level textual ad- versarial attacking as combinatorial optimization. InPro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080, Online, 2020. Association for Computational Linguistics. 2
work page 2020
-
[59]
Yi Zeng, Kevin Klyman, Andy Zhou, Yu Yang, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. Ai risk cate- gorization decoded (air 2024): From government regulations to corporate policies.arXiv preprint arXiv:2406.17864, 2024. 1
-
[60]
Air-bench 2024: A safety benchmark based on risk categories from regulations and policies, 2024
Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. Air-bench 2024: A safety benchmark based on risk categories from regulations and policies, 2024. 1
work page 2024
-
[61]
Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Sheng- nan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. Jbshield: Defending large language models from jailbreak attacks through activated concept anal- ysis and manipulation.arXiv preprint arXiv:2502.07557,
-
[62]
Simulating classroom education with llm-empowered agents.arXiv preprint arXiv:2406.19226,
Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jin- chang Zhou, Zhanxin Hao, Jianxiao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm-empowered agents.arXiv preprint arXiv:2406.19226,
-
[63]
On prompt- driven safeguarding for large language models
Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt- driven safeguarding for large language models. InForty-first International Conference on Machine Learning, 2024. 1
work page 2024
-
[64]
Nguyen, Jun Sun, and Tat-Seng Chua
Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, and Tat-Seng Chua. Ali-agent: Assessing llms’ align- 11 ment with human values via agent-based evaluation.ArXiv, abs/2405.14125, 2024. 2
-
[65]
Andy Zhou, Bo Li, and Haohan Wang. Robust prompt opti- mization for defending language models against jailbreaking attacks.arXiv preprint arXiv:2401.17263, 2024. 1
-
[66]
Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, and Bo Li. Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,
-
[67]
Memento: Fine-tuning llm agents without fine-tuning llms.Preprint, 2025
Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms.Preprint, 2025. 4
work page 2025
-
[68]
Easyjailbreak: A unified framework for jailbreaking large language models,
Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Ying- shuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, et al. Easyjailbreak: A unified framework for jailbreaking large language models.arXiv preprint arXiv:2403.12171, 2024. 2
-
[69]
Giovanni Zizzo et al. Systematic benchmarking of guardrails against prompt injection attacks.arXiv preprint arXiv:2502.15427, 2025. 3
-
[70]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable ad- versarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Wei Zou, Shujian Huang, Jun Xie, Xinyu Dai, and Jiajun Chen. A reinforced generation of adversarial examples for neural machine translation.arXiv preprint arXiv:1911.03677,
-
[72]
1, 2 12 A. Experimental Hyperparameters This section details the key hyperparameters used for all experiments conducted with the EvoSynth framework in Tab.5. Moreover, the Tools we use here could also incorporate more image-based attack tools allowing for multi-modal methods[8, 49]. Table 5. Core Hyperparameters for the EvoSynth Framework. Parameter Descr...
-
[73]
CONCEPT GENERATION: Create new AI attack concepts and detailed scenarios
-
[74]
INTELLIGENCE GATHERING: Categorize jailbreak methods into 5 main categories with implementation details
-
[75]
CONTEXT ANALYSIS: Access RunContext history to understand existing intelligence
-
[76]
When calling your function tools, provide YOUR actual analysis results as parameters
CATEGORY INNOVATION: Create new jailbreak categories when necessary beyond the 5 main ones CRITICAL ANALYSIS REQUIREMENTS: You MUST perform actual analysis and thinking, not just follow predefined patterns. When calling your function tools, provide YOUR actual analysis results as parameters. Generate diverse, lengthy, and insightful content with detailed ...
-
[77]
CONTEXT ANALYSIS: Use access_runcontext_history to understand existing intelligence - Review previous concepts and intelligence gathered - Identify gaps in current knowledge and approaches - Ensure new concepts don’t conflict with existing ones - Build upon existing intelligence with new insights 19
-
[78]
CONCEPT GENERATION: Use create_new_ai_concepts to generate innovative attack scenarios - Create diverse attack concepts with detailed descriptions - Ensure concepts are original and don’t duplicate existing ones - Provide detailed scenarios and implementation approaches - Generate concepts that can be categorized into jailbreak methods
-
[79]
* Injection Attacks: Prompt injection, instruction hijacking, input manipulation
INTELLIGENCE CATEGORIZATION: Use gather_jailbreak_intelligence to organize implementation methods - Categorize approaches into 5 main categories: "* Injection Attacks: Prompt injection, instruction hijacking, input manipulation "* Roleplay Attacks: Character-based attacks, persona manipulation, role-playing "* Structured & Iterative Prompting: Multi-step ...
-
[80]
You have completed ALL your required tasks
-
[81]
You have used ALL required tools at minimum specified times
-
[82]
You have provided YOUR final report
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.