pith. machine review for the scientific record. sign in

arxiv: 2407.04295 · v2 · submitted 2024-07-05 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Recognition: 1 theorem link

· Lean Theorem

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG
keywords methodsllmsattackjailbreakattacksdefenseadversarialdefenses
0
0 comments X

The pith

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can be tricked by carefully crafted prompts into producing harmful or restricted outputs, a problem called jailbreaking. This paper maps out the different attack techniques, grouping them by whether the attacker has access to the model's internal details or not. It does the same for defense strategies, separating those that modify the input prompt from those that change the model itself. The authors also examine how researchers currently test these attacks and defenses, comparing the methods from multiple angles. The overall goal is to give the community a clearer picture of the landscape so that better protections can be built. Because the field is moving quickly, the survey aims to serve as a reference point for future work on making these AI systems more reliable and less prone to misuse.

Core claim

we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods... and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives.

Load-bearing premise

That the proposed taxonomy and sub-classifications accurately and comprehensively capture the current landscape of attacks and defenses without significant omissions or overlaps that would require revision.

read the original abstract

Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the work introduces no free parameters, axioms, or invented entities; it relies entirely on summarizing and classifying prior published research.

pith-pipeline@v0.9.0 · 5540 in / 1003 out tokens · 45018 ms · 2026-05-15T02:17:07.003351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

    cs.LG 2026-05 unverdicted novelty 8.0

    OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...

  2. SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

    cs.CR 2026-05 unverdicted novelty 7.0

    SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across mul...

  3. RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...

  4. Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

    cs.CR 2026-04 unverdicted novelty 7.0

    A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

  5. MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

    cs.CR 2026-05 unverdicted novelty 6.0

    MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.

  6. TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

    cs.CR 2026-04 unverdicted novelty 6.0

    TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

  7. Training a General Purpose Automated Red Teaming Model

    cs.CR 2026-04 unverdicted novelty 6.0

    A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.

  8. How Adversarial Environments Mislead Agentic AI?

    cs.AI 2026-04 unverdicted novelty 6.0

    Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.

  9. Towards Understanding the Robustness of Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 6.0

    Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

  10. LLM Safety From Within: Detecting Harmful Content with Internal Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

  11. TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

    cs.CR 2026-04 unverdicted novelty 6.0

    TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.

  12. The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

    cs.CR 2026-04 unverdicted novelty 6.0

    Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

  13. Compression as an Adversarial Amplifier Through Decision Space Reduction

    cs.CV 2026-04 unverdicted novelty 6.0

    Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same pertu...

  14. Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

    cs.CR 2026-03 unverdicted novelty 6.0

    Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.

  15. Insider Attacks in Multi-Agent LLM Consensus Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.

  16. SoK: Robustness in Large Language Models against Jailbreak Attacks

    cs.CR 2026-05 accept novelty 5.0

    The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

  17. AgentDID: Trustless Identity Authentication for AI Agents

    cs.CR 2026-04 unverdicted novelty 5.0

    AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.

  18. An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    Multi-generation sampling from LLMs uncovers more jailbreak behaviors than single generations, with the largest gains from one to moderate sample counts and diminishing returns thereafter.

  19. Jailbreaking Large Language Models with Morality Attacks

    cs.CL 2026-04 unverdicted novelty 5.0

    Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.

  20. Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference

    cs.CR 2026-04 unverdicted novelty 4.0

    A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 20 Pith papers · 16 internal anchors

  1. [1]

    Detecting language model attacks with perplexity

    Gabriel Alon and Michael Kamfonas. Detecting Language Model Attacks with Perplexity. CoRR abs/2308.14132, 2023. 12, 16

  2. [2]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking Leading Safety- Aligned LLMs with Simple Adaptive Attacks. CoRR abs/2404.02151, 2024. 4

  3. [3]

    Gemini: A Family of Highly Capable Multimodal Models

    Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin John- son, Ioannis Antonoglou, Julian Schrittwieser, Amelia 17 Glaese, Jilin Chen, Emily Pitler, Timothy P. Lilli- crap, Angeliki Lazaridou, Orhan Firat, James Mol- loy, ...

  4. [4]

    Introducing claude

    Anthropic. Introducing claude. https://www. anthropic.com/news/introducing-claude,

  5. [5]

    Many-shot jailbreaking

    Anthropic. Many-shot jailbreaking. https: //www.anthropic.com/research/many-shot- jailbreaking, 2024. 4, 8

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  7. [7]

    How (un)ethical are instruction- centric responses of LLMs? Unveiling the vulnera- bilities of safety guardrails to harmful queries

    Somnath Banerjee, Sayan Layek, Rima Hazra, and Animesh Mukherjee. How (un)ethical are instruction- centric responses of LLMs? Unveiling the vulnera- bilities of safety guardrails to harmful queries. CoRR abs/2402.15302, 2024. 16, 17

  8. [8]

    Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

    Rishabh Bhardwaj and Soujanya Poria. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. CoRR abs/2308.09662, 2023. 12, 14

  9. [9]

    Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

    Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. In International Conference on Learning Representations (ICLR), 2024. 12, 13, 14

  10. [10]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  11. [11]

    Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.CoRR abs/2309.14348, 2023

    Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.CoRR abs/2309.14348, 2023. 12

  12. [12]

    Explore, Establish, Ex- ploit: Red Teaming Language Models from Scratch

    Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell. Explore, Establish, Ex- ploit: Red Teaming Language Models from Scratch. CoRR abs/2306.09442, 2023. 4, 11

  13. [13]

    Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues

    Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, and Yang Liu. Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues. CoRR abs/2402.09091, 2024. 4, 9

  14. [14]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammar- ion, George J. Pappas, Florian Tramèr, Hamed Has- sani, and Eric Wong. JailbreakBench: An Open Ro- bustness Benchmark for Jailbreaking Large Language Models. CoRR abs/2404.01318, 2024. 16, 17

  15. [15]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries. CoRR abs/2310.08419, 2023. 4, 11

  16. [16]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Har- rison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo- hammad B...

  17. [17]

    Comprehen- sive Assessment of Jailbreak Attacks Against LLMs

    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Comprehen- sive Assessment of Jailbreak Attacks Against LLMs. CoRR abs/2402.05668, 2024. 2

  18. [18]

    Attack Prompt Gen- eration for Red Teaming and Defending Large Lan- 18 guage Models

    Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qi- fan Wang, and Xiangnan He. Attack Prompt Gen- eration for Red Teaming and Defending Large Lan- 18 guage Models. In Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 2176–2189. ACL, 2023. 12, 13

  19. [19]

    Jailbreaker: Automated jailbreak across multiple large language model chatbots

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots. CoRR abs/2307.08715, 2023. 4, 10

  20. [20]

    Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning.CoRR abs/2402.08416, 2024

    Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tian- wei Zhang, and Yang Liu. Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning.CoRR abs/2402.08416, 2024. 4, 7

  21. [21]

    Multilingual Jailbreak Challenges in Large Language Models

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual Jailbreak Challenges in Large Language Models. In International Conference on Learning Representations (ICLR), 2024. 4, 9

  22. [22]

    A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

    Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yun- sen Xian, Jiajun Chen, and Shujian Huang. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. CoRR abs/2311.08268, 2023. 4, 7

  23. [23]

    Analyzing the Inherent Response Ten- dency of LLMs: Real-World Instructions-Driven Jail- break

    Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, and Bing Qin. Analyzing the Inherent Response Ten- dency of LLMs: Real-World Instructions-Driven Jail- break. CoRR abs/2312.04127, 2023. 4, 5

  24. [24]

    A Comprehensive Survey of Attack Tech- niques, Implementation, and Mitigation Strategies in Large Language Models

    Aysan Esmradi, Daniel Wankit Yip, and Chun-Fai Chan. A Comprehensive Survey of Attack Tech- niques, Implementation, and Mitigation Strategies in Large Language Models. CoRR abs/2312.10982 ,

  25. [25]

    Configurable Safety Tuning of Lan- guage Models with Synthetic Preference Data

    Víctor Gallego. Configurable Safety Tuning of Lan- guage Models with Synthetic Preference Data. CoRR abs/2404.00495, 2024. 12, 14

  26. [26]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield- Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Jo...

  27. [27]

    MART: improving LLM safety with multi- round automatic red-teaming

    Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yun- ing Mao. MART: improving LLM safety with multi- round automatic red-teaming. CoRR abs/2311.07689,

  28. [28]

    Coercing LLMs to do and reveal (almost) anything

    Jonas Geiping, Alex Stein, Manli Shu, Khalid Sai- fullah, Yuxin Wen, and Tom Goldstein. Coercing LLMs to do and reveal (almost) anything. CoRR abs/2402.14020, 2024. 2

  29. [29]

    Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, and Stephan Günnemann. Attack- ing Large Language Models with Projected Gradient Descent. CoRR abs/2402.09154, 2024. 4

  30. [30]

    FigStep: Jailbreaking Large Vision- language Models via Typographic Visual Prompts

    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. FigStep: Jailbreaking Large Vision- language Models via Typographic Visual Prompts. CoRR abs/2311.05608, 2023. 16, 17

  31. [31]

    COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

    Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability. CoRR abs/2402.08679, 2024. 4, 5

  32. [32]

    From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy

    Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. CoRR abs/2307.00691, 2023. 1, 2

  33. [33]

    Gajera, and Chitta Baral

    Divij Handa, Advait Chirmule, Bimal G. Gajera, and Chitta Baral. Jailbreaking Proprietary Large Lan- guage Models using Word Substitution Cipher. CoRR abs/2402.10601, 2024. 4, 9

  34. [34]

    Query-based adversarial prompt generation

    Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, and Milad Nasr. Query-Based Ad- versarial Prompt Generation. CoRR abs/2402.12329,

  35. [35]

    Gra- dient Cuff: Detecting Jailbreak Attacks on Large Lan- guage Models by Exploring Refusal Loss Landscapes

    Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. Gra- dient Cuff: Detecting Jailbreak Attacks on Large Lan- guage Models by Exploring Refusal Loss Landscapes. CoRR abs/2403.00867, 2024. 12, 14

  36. [36]

    Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

    Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In In- ternational Conference on Learning Representations (ICLR), 2024. 4, 5

  37. [37]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline Defenses for Adversar- ial Attacks Against Aligned Language Models. CoRR abs/2309.00614, 2023. 12

  38. [38]

    Defending large language models against jailbreak attacks via semantic smoothing

    Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. Defending Large Language Mod- els against Jailbreak Attacks via Semantic Smoothing. CoRR abs/2402.16192, 2024. 12

  39. [39]

    Beavertails: Towards improved safety alignment of LLM via a human- preference dataset

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human- preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 14 19

  40. [40]

    ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

    Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs. CoRR abs/2402.11753, 2024. 4, 9

  41. [41]

    Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models

    Haibo Jin, Ruoxi Chen, Andy Zhou, Jinyin Chen, Yang Zhang, and Haohan Wang. GUARD: role- playing to generate natural-language jailbreakings to test guideline adherence of large language models. CoRR abs/2402.03299, 2024. 4, 11

  42. [42]

    Dragan, Aditi Raghunathan, and Jacob Steinhardt

    Erik Jones, Anca D. Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically Auditing Large Lan- guage Models via Discrete Optimization. In Inter- national Conference on Machine Learning (ICML) , pages 15307–15329. PMLR, 2023. 3, 4, 5

  43. [43]

    Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks

    Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks. CoRR abs/2302.05733, 2023. 4, 8

  44. [44]

    Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

    Heegyu Kim, Sehyun Yuk, and Hyunsouk Cho. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement. CoRR abs/2402.15180, 2024. 12, 15

  45. [45]

    Certifying llm safety against adversarial prompting

    Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying LLM Safety against Adversarial Prompting. CoRR abs/2309.02705, 2023. 12

  46. [46]

    Open sesame! universal black box jailbreaking of large language models

    Raz Lapid, Ron Langberg, and Moshe Sipper. Open Sesame! Universal Black Box Jailbreaking of Large Language Models. CoRR abs/2309.01446, 2023. 4, 10

  47. [47]

    Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

    Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. CoRR abs/2310.20624,

  48. [48]

    Multi-step jailbreaking privacy attacks on chatgpt

    Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy At- tacks on ChatGPT. CoRR abs/2304.05197, 2023. 4, 7

  49. [49]

    A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

    Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, and Yinxing Xue. A Cross-Language Investigation into Jailbreak Attacks in Large Language Models. CoRR abs/2401.16765,

  50. [50]

    Semantic Mirror Jail- break: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

    Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Ais- han Liu, and Ee-Chien Chang. Semantic Mirror Jail- break: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs. CoRR abs/2402.14872,

  51. [51]

    DrAttack: Prompt Decom- position and Reconstruction Makes Powerful LLM Jailbreakers

    Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. DrAttack: Prompt Decom- position and Reconstruction Makes Powerful LLM Jailbreakers. CoRR abs/2402.16914, 2024. 9

  52. [52]

    DeepInception: Hypno- tize Large Language Model to Be Jailbreaker

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. DeepInception: Hypno- tize Large Language Model to Be Jailbreaker. CoRR abs/2311.03191, 2023. 4, 7

  53. [53]

    RAIN: your language mod- els can align themselves without finetuning

    Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: your language mod- els can align themselves without finetuning. CoRR abs/2309.07124, 2023. 12, 14

  54. [54]

    Goal- Oriented Prompt Attack and Safety Evaluation for LLMs

    Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, and Fei Wu. Goal- Oriented Prompt Attack and Safety Evaluation for LLMs. CoRR abs/2309.11830, 2023. 4, 11

  55. [55]

    Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

    Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, and Kai Chen. Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction. CoRR abs/2402.18104, 2024. 4, 9

  56. [56]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451, 2023. 4, 10, 16

  57. [57]

    Jailbreaking chat- gpt via prompt engineering: An empirical study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engi- neering: An Empirical Study. CoRR abs/2305.13860,

  58. [58]

    Safe and helpful chinese

    Yule Liu, Kaitian Chao Ting Lu, Yanshun Zhang, and Yingliang Zhang. Safe and helpful chinese. https://huggingface.co/datasets/DirectLLM/ Safe_and_Helpful_Chinese, 2023. 12, 14

  59. [59]

    Enhanc- ing LLM safety via constrained direct preference op- timization

    Zixuan Liu, Xiaolin Sun, and Zizhan Zheng. Enhanc- ing LLM safety via constrained direct preference op- timization. CoRR abs/2403.02475, 2024. 12, 14

  60. [60]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An Empirical Study of Catas- trophic Forgetting in Large Language Models During Continual Fine-tuning. CoRR abs/2308.08747, 2023. 14

  61. [61]

    CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Lan- guage Models

    Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Lan- guage Models. CoRR abs/2402.16717, 2024. 4, 8

  62. [62]

    PRP: propagating universal pertur- bations to attack large languagenmodel guard-rails

    Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, and Atul Prakash. PRP: propagating universal pertur- bations to attack large languagenmodel guard-rails. CoRR abs/2402.15911, 2024. 4, 5

  63. [63]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming 20 and Robust Refusal. CoRR abs/2402.04249, 2024. 16, 17

  64. [64]

    Tree of attacks: Jailbreaking black-box llms automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black- Box LLMs Automatically. CoRR abs/2312.02119 ,

  65. [65]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR abs/2303.08774, 2023. 14

  66. [66]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welin- der, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...

  67. [67]

    Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs

    Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs. CoRR abs/2404.16873, 2024. 16, 17

  68. [68]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! CoRR abs/2310.03693, 2023. 4, 6, 14, 15

  69. [69]

    Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

    Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, and Zhenzhong Lan. Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models. CoRR abs/2307.08487,

  70. [70]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Annual Con- ference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023. 14

  71. [71]

    Jail- breakEval: An Integrated Toolkit for Evaluating Jail- break Attempts Against Large Language Models

    Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, and Anyu Wang. Jail- breakEval: An Integrated Toolkit for Evaluating Jail- break Attempts Against Large Language Models. CoRR abs/2406.09321, 2024. 15, 16

  72. [72]

    Tricking LLMs into Disobedience: Understanding, Analyzing, and Pre- venting Jailbreaks

    Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. Tricking LLMs into Disobedience: Understanding, Analyzing, and Pre- venting Jailbreaks. CoRR abs/2305.14965, 2023. 2

  73. [73]

    Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. CoRR abs/2310.03684, 2023. 12

  74. [74]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Paul R"ottger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A Test Suite for Identifying Exag- gerated Safety Behaviours in Large Language Models. CoRR abs/2308.01263, 2023. 16, 17

  75. [75]

    Boyd-Graber

    Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis- François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, and Jordan L. Boyd-Graber. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Compe- tition. In Proceedings of the 2023 Conference on Empirical M...

  76. [76]

    Scalable and transferable black-box jailbreaks for language models via persona modulation

    Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and Transferable Black-Box Jail- breaks for Language Models via Persona Modulation. CoRR abs/2311.03348, 2023. 4, 11

  77. [77]

    Sharma, Vinayak Gupta, and Dan Gross- man

    Reshabh K. Sharma, Vinayak Gupta, and Dan Gross- man. SPML: A DSL for defending language models against prompt attacks. CoRR abs/2402.11755, 2024. 12, 13

  78. [78]

    Survey of vulnerabilities in large language models revealed by adversarial attacks

    Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael B. Abu- Ghazaleh. Survey of Vulnerabilities in Large Lan- guage Models Revealed by Adversarial Attacks. CoRR abs/2310.10844, 2023. 2

  79. [79]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Character- izing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825,

  80. [80]

    At- tackEval: How to Evaluate the Effectiveness of Jail- break Attacking on Large Language Models

    Dong Shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, and Yongfeng Zhang. At- tackEval: How to Evaluate the Effectiveness of Jail- break Attacking on Large Language Models. CoRR abs/2401.09002, 2024. 16, 17

Showing first 80 references.