Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Sibo Yi , Yule Liu , Zhen Sun , Tianshuo Cong , Xinlei He , Jiaxing Song , Ke Xu , Qi Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG

keywords methodsllmsattackjailbreakattacksdefenseadversarialdefenses

0 comments

The pith

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can be tricked by carefully crafted prompts into producing harmful or restricted outputs, a problem called jailbreaking. This paper maps out the different attack techniques, grouping them by whether the attacker has access to the model's internal details or not. It does the same for defense strategies, separating those that modify the input prompt from those that change the model itself. The authors also examine how researchers currently test these attacks and defenses, comparing the methods from multiple angles. The overall goal is to give the community a clearer picture of the landscape so that better protections can be built. Because the field is moving quickly, the survey aims to serve as a reference point for future work on making these AI systems more reliable and less prone to misuse.

Core claim

we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods... and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives.

Load-bearing premise

That the proposed taxonomy and sub-classifications accurately and comprehensively capture the current landscape of attacks and defenses without significant omissions or overlaps that would require revision.

read the original abstract

Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the work introduces no free parameters, axioms, or invented entities; it relies entirely on summarizing and classifying prior published research.

pith-pipeline@v0.9.0 · 5540 in / 1003 out tokens · 45018 ms · 2026-05-15T02:17:07.003351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods... attack methods are divided into black-box and white-box attacks... defense methods into prompt-level and model-level defenses.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
cs.LG 2026-05 unverdicted novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
cs.CR 2026-05 unverdicted novelty 7.0

SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across mul...
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
cs.LG 2026-05 unverdicted novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
cs.CR 2026-04 unverdicted novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
cs.CR 2026-05 unverdicted novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
cs.CR 2026-04 unverdicted novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Training a General Purpose Automated Red Teaming Model
cs.CR 2026-04 unverdicted novelty 6.0

A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.
How Adversarial Environments Mislead Agentic AI?
cs.AI 2026-04 unverdicted novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
Towards Understanding the Robustness of Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
LLM Safety From Within: Detecting Harmful Content with Internal Representations
cs.AI 2026-04 unverdicted novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
cs.CR 2026-04 unverdicted novelty 6.0

TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
cs.CR 2026-04 unverdicted novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Compression as an Adversarial Amplifier Through Decision Space Reduction
cs.CV 2026-04 unverdicted novelty 6.0

Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same pertu...
Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models
cs.CR 2026-03 unverdicted novelty 6.0

Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.
Insider Attacks in Multi-Agent LLM Consensus Systems
cs.MA 2026-05 unverdicted novelty 5.0

A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
SoK: Robustness in Large Language Models against Jailbreak Attacks
cs.CR 2026-05 accept novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
AgentDID: Trustless Identity Authentication for AI Agents
cs.CR 2026-04 unverdicted novelty 5.0

AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Multi-generation sampling from LLMs uncovers more jailbreak behaviors than single generations, with the largest gains from one to moderate sample counts and diminishing returns thereafter.
Jailbreaking Large Language Models with Morality Attacks
cs.CL 2026-04 unverdicted novelty 5.0

Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference
cs.CR 2026-04 unverdicted novelty 4.0

A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 20 Pith papers · 16 internal anchors

[1]

Detecting language model attacks with perplexity

Gabriel Alon and Michael Kamfonas. Detecting Language Model Attacks with Perplexity. CoRR abs/2308.14132, 2023. 12, 16

work page arXiv 2023
[2]

Jailbreaking leading safety-aligned llms with simple adaptive attacks

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking Leading Safety- Aligned LLMs with Simple Adaptive Attacks. CoRR abs/2404.02151, 2024. 4

work page arXiv 2024
[3]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin John- son, Ioannis Antonoglou, Julian Schrittwieser, Amelia 17 Glaese, Jilin Chen, Emily Pitler, Timothy P. Lilli- crap, Angeliki Lazaridou, Orhan Firat, James Mol- loy, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Introducing claude

Anthropic. Introducing claude. https://www. anthropic.com/news/introducing-claude,

work page
[5]

Many-shot jailbreaking

Anthropic. Many-shot jailbreaking. https: //www.anthropic.com/research/many-shot- jailbreaking, 2024. 4, 8

work page 2024
[6]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

How (un)ethical are instruction- centric responses of LLMs? Unveiling the vulnera- bilities of safety guardrails to harmful queries

Somnath Banerjee, Sayan Layek, Rima Hazra, and Animesh Mukherjee. How (un)ethical are instruction- centric responses of LLMs? Unveiling the vulnera- bilities of safety guardrails to harmful queries. CoRR abs/2402.15302, 2024. 16, 17

work page arXiv 2024
[8]

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Rishabh Bhardwaj and Soujanya Poria. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. CoRR abs/2308.09662, 2023. 12, 14

work page arXiv 2023
[9]

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. In International Conference on Learning Representations (ICLR), 2024. 12, 13, 14

work page 2024
[10]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[11]

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.CoRR abs/2309.14348, 2023

Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.CoRR abs/2309.14348, 2023. 12

work page arXiv 2023
[12]

Explore, Establish, Ex- ploit: Red Teaming Language Models from Scratch

Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell. Explore, Establish, Ex- ploit: Red Teaming Language Models from Scratch. CoRR abs/2306.09442, 2023. 4, 11

work page arXiv 2023
[13]

Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues

Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, and Yang Liu. Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues. CoRR abs/2402.09091, 2024. 4, 9

work page arXiv 2024
[14]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammar- ion, George J. Pappas, Florian Tramèr, Hamed Has- sani, and Eric Wong. JailbreakBench: An Open Ro- bustness Benchmark for Jailbreaking Large Language Models. CoRR abs/2404.01318, 2024. 16, 17

work page internal anchor Pith review arXiv 2024
[15]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries. CoRR abs/2310.08419, 2023. 4, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Har- rison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo- hammad B...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Comprehen- sive Assessment of Jailbreak Attacks Against LLMs

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Comprehen- sive Assessment of Jailbreak Attacks Against LLMs. CoRR abs/2402.05668, 2024. 2

work page arXiv 2024
[18]

Attack Prompt Gen- eration for Red Teaming and Defending Large Lan- 18 guage Models

Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qi- fan Wang, and Xiangnan He. Attack Prompt Gen- eration for Red Teaming and Defending Large Lan- 18 guage Models. In Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 2176–2189. ACL, 2023. 12, 13

work page 2023
[19]

Jailbreaker: Automated jailbreak across multiple large language model chatbots

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots. CoRR abs/2307.08715, 2023. 4, 10

work page arXiv 2023
[20]

Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning.CoRR abs/2402.08416, 2024

Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tian- wei Zhang, and Yang Liu. Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning.CoRR abs/2402.08416, 2024. 4, 7

work page arXiv 2024
[21]

Multilingual Jailbreak Challenges in Large Language Models

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual Jailbreak Challenges in Large Language Models. In International Conference on Learning Representations (ICLR), 2024. 4, 9

work page 2024
[22]

A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yun- sen Xian, Jiajun Chen, and Shujian Huang. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. CoRR abs/2311.08268, 2023. 4, 7

work page arXiv 2023
[23]

Analyzing the Inherent Response Ten- dency of LLMs: Real-World Instructions-Driven Jail- break

Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, and Bing Qin. Analyzing the Inherent Response Ten- dency of LLMs: Real-World Instructions-Driven Jail- break. CoRR abs/2312.04127, 2023. 4, 5

work page arXiv 2023
[24]

A Comprehensive Survey of Attack Tech- niques, Implementation, and Mitigation Strategies in Large Language Models

Aysan Esmradi, Daniel Wankit Yip, and Chun-Fai Chan. A Comprehensive Survey of Attack Tech- niques, Implementation, and Mitigation Strategies in Large Language Models. CoRR abs/2312.10982 ,

work page arXiv
[25]

Configurable Safety Tuning of Lan- guage Models with Synthetic Preference Data

Víctor Gallego. Configurable Safety Tuning of Lan- guage Models with Synthetic Preference Data. CoRR abs/2404.00495, 2024. 12, 14

work page arXiv 2024
[26]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield- Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Jo...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

MART: improving LLM safety with multi- round automatic red-teaming

Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yun- ing Mao. MART: improving LLM safety with multi- round automatic red-teaming. CoRR abs/2311.07689,

work page arXiv
[28]

Coercing LLMs to do and reveal (almost) anything

Jonas Geiping, Alex Stein, Manli Shu, Khalid Sai- fullah, Yuxin Wen, and Tom Goldstein. Coercing LLMs to do and reveal (almost) anything. CoRR abs/2402.14020, 2024. 2

work page arXiv 2024
[29]

Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, and Stephan Günnemann. Attack- ing Large Language Models with Projected Gradient Descent. CoRR abs/2402.09154, 2024. 4

work page arXiv 2024
[30]

FigStep: Jailbreaking Large Vision- language Models via Typographic Visual Prompts

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. FigStep: Jailbreaking Large Vision- language Models via Typographic Visual Prompts. CoRR abs/2311.05608, 2023. 16, 17

work page arXiv 2023
[31]

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability. CoRR abs/2402.08679, 2024. 4, 5

work page arXiv 2024
[32]

From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy

Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. CoRR abs/2307.00691, 2023. 1, 2

work page arXiv 2023
[33]

Gajera, and Chitta Baral

Divij Handa, Advait Chirmule, Bimal G. Gajera, and Chitta Baral. Jailbreaking Proprietary Large Lan- guage Models using Word Substitution Cipher. CoRR abs/2402.10601, 2024. 4, 9

work page arXiv 2024
[34]

Query-based adversarial prompt generation

Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, and Milad Nasr. Query-Based Ad- versarial Prompt Generation. CoRR abs/2402.12329,

work page arXiv
[35]

Gra- dient Cuff: Detecting Jailbreak Attacks on Large Lan- guage Models by Exploring Refusal Loss Landscapes

Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. Gra- dient Cuff: Detecting Jailbreak Attacks on Large Lan- guage Models by Exploring Refusal Loss Landscapes. CoRR abs/2403.00867, 2024. 12, 14

work page arXiv 2024
[36]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In In- ternational Conference on Learning Representations (ICLR), 2024. 4, 5

work page 2024
[37]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline Defenses for Adversar- ial Attacks Against Aligned Language Models. CoRR abs/2309.00614, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Defending large language models against jailbreak attacks via semantic smoothing

Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. Defending Large Language Mod- els against Jailbreak Attacks via Semantic Smoothing. CoRR abs/2402.16192, 2024. 12

work page arXiv 2024
[39]

Beavertails: Towards improved safety alignment of LLM via a human- preference dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human- preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 14 19

work page 2023
[40]

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs. CoRR abs/2402.11753, 2024. 4, 9

work page arXiv 2024
[41]

Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models

Haibo Jin, Ruoxi Chen, Andy Zhou, Jinyin Chen, Yang Zhang, and Haohan Wang. GUARD: role- playing to generate natural-language jailbreakings to test guideline adherence of large language models. CoRR abs/2402.03299, 2024. 4, 11

work page arXiv 2024
[42]

Dragan, Aditi Raghunathan, and Jacob Steinhardt

Erik Jones, Anca D. Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically Auditing Large Lan- guage Models via Discrete Optimization. In Inter- national Conference on Machine Learning (ICML) , pages 15307–15329. PMLR, 2023. 3, 4, 5

work page 2023
[43]

Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks. CoRR abs/2302.05733, 2023. 4, 8

work page arXiv 2023
[44]

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Heegyu Kim, Sehyun Yuk, and Hyunsouk Cho. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement. CoRR abs/2402.15180, 2024. 12, 15

work page arXiv 2024
[45]

Certifying llm safety against adversarial prompting

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying LLM Safety against Adversarial Prompting. CoRR abs/2309.02705, 2023. 12

work page arXiv 2023
[46]

Open sesame! universal black box jailbreaking of large language models

Raz Lapid, Ron Langberg, and Moshe Sipper. Open Sesame! Universal Black Box Jailbreaking of Large Language Models. CoRR abs/2309.01446, 2023. 4, 10

work page arXiv 2023
[47]

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. CoRR abs/2310.20624,

work page arXiv
[48]

Multi-step jailbreaking privacy attacks on chatgpt

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy At- tacks on ChatGPT. CoRR abs/2304.05197, 2023. 4, 7

work page arXiv 2023
[49]

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, and Yinxing Xue. A Cross-Language Investigation into Jailbreak Attacks in Large Language Models. CoRR abs/2401.16765,

work page arXiv
[50]

Semantic Mirror Jail- break: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Ais- han Liu, and Ee-Chien Chang. Semantic Mirror Jail- break: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs. CoRR abs/2402.14872,

work page arXiv
[51]

DrAttack: Prompt Decom- position and Reconstruction Makes Powerful LLM Jailbreakers

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. DrAttack: Prompt Decom- position and Reconstruction Makes Powerful LLM Jailbreakers. CoRR abs/2402.16914, 2024. 9

work page arXiv 2024
[52]

DeepInception: Hypno- tize Large Language Model to Be Jailbreaker

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. DeepInception: Hypno- tize Large Language Model to Be Jailbreaker. CoRR abs/2311.03191, 2023. 4, 7

work page arXiv 2023
[53]

RAIN: your language mod- els can align themselves without finetuning

Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: your language mod- els can align themselves without finetuning. CoRR abs/2309.07124, 2023. 12, 14

work page arXiv 2023
[54]

Goal- Oriented Prompt Attack and Safety Evaluation for LLMs

Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, and Fei Wu. Goal- Oriented Prompt Attack and Safety Evaluation for LLMs. CoRR abs/2309.11830, 2023. 4, 11

work page arXiv 2023
[55]

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, and Kai Chen. Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction. CoRR abs/2402.18104, 2024. 4, 9

work page arXiv 2024
[56]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451, 2023. 4, 10, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Jailbreaking chat- gpt via prompt engineering: An empirical study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engi- neering: An Empirical Study. CoRR abs/2305.13860,

work page arXiv
[58]

Safe and helpful chinese

Yule Liu, Kaitian Chao Ting Lu, Yanshun Zhang, and Yingliang Zhang. Safe and helpful chinese. https://huggingface.co/datasets/DirectLLM/ Safe_and_Helpful_Chinese, 2023. 12, 14

work page 2023
[59]

Enhanc- ing LLM safety via constrained direct preference op- timization

Zixuan Liu, Xiaolin Sun, and Zizhan Zheng. Enhanc- ing LLM safety via constrained direct preference op- timization. CoRR abs/2403.02475, 2024. 12, 14

work page arXiv 2024
[60]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An Empirical Study of Catas- trophic Forgetting in Large Language Models During Continual Fine-tuning. CoRR abs/2308.08747, 2023. 14

work page arXiv 2023
[61]

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Lan- guage Models

Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Lan- guage Models. CoRR abs/2402.16717, 2024. 4, 8

work page arXiv 2024
[62]

PRP: propagating universal pertur- bations to attack large languagenmodel guard-rails

Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, and Atul Prakash. PRP: propagating universal pertur- bations to attack large languagenmodel guard-rails. CoRR abs/2402.15911, 2024. 4, 5

work page arXiv 2024
[63]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming 20 and Robust Refusal. CoRR abs/2402.04249, 2024. 16, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Tree of attacks: Jailbreaking black-box llms automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black- Box LLMs Automatically. CoRR abs/2312.02119 ,

work page arXiv
[65]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR abs/2303.08774, 2023. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welin- der, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...

work page
[67]

Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs. CoRR abs/2404.16873, 2024. 16, 17

work page arXiv 2024
[68]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! CoRR abs/2310.03693, 2023. 4, 6, 14, 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, and Zhenzhong Lan. Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models. CoRR abs/2307.08487,

work page arXiv
[70]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Annual Con- ference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023. 14

work page 2023
[71]

Jail- breakEval: An Integrated Toolkit for Evaluating Jail- break Attempts Against Large Language Models

Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, and Anyu Wang. Jail- breakEval: An Integrated Toolkit for Evaluating Jail- break Attempts Against Large Language Models. CoRR abs/2406.09321, 2024. 15, 16

work page arXiv 2024
[72]

Tricking LLMs into Disobedience: Understanding, Analyzing, and Pre- venting Jailbreaks

Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. Tricking LLMs into Disobedience: Understanding, Analyzing, and Pre- venting Jailbreaks. CoRR abs/2305.14965, 2023. 2

work page arXiv 2023
[73]

Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. CoRR abs/2310.03684, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul R"ottger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A Test Suite for Identifying Exag- gerated Safety Behaviours in Large Language Models. CoRR abs/2308.01263, 2023. 16, 17

work page internal anchor Pith review arXiv 2023
[75]

Boyd-Graber

Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis- François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, and Jordan L. Boyd-Graber. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Compe- tition. In Proceedings of the 2023 Conference on Empirical M...

work page 2023
[76]

Scalable and transferable black-box jailbreaks for language models via persona modulation

Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and Transferable Black-Box Jail- breaks for Language Models via Persona Modulation. CoRR abs/2311.03348, 2023. 4, 11

work page arXiv 2023
[77]

Sharma, Vinayak Gupta, and Dan Gross- man

Reshabh K. Sharma, Vinayak Gupta, and Dan Gross- man. SPML: A DSL for defending language models against prompt attacks. CoRR abs/2402.11755, 2024. 12, 13

work page arXiv 2024
[78]

Survey of vulnerabilities in large language models revealed by adversarial attacks

Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael B. Abu- Ghazaleh. Survey of Vulnerabilities in Large Lan- guage Models Revealed by Adversarial Attacks. CoRR abs/2310.10844, 2023. 2

work page arXiv 2023
[79]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Character- izing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825,

work page arXiv
[80]

At- tackEval: How to Evaluate the Effectiveness of Jail- break Attacking on Large Language Models

Dong Shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, and Yongfeng Zhang. At- tackEval: How to Evaluate the Effectiveness of Jail- break Attacking on Large Language Models. CoRR abs/2401.09002, 2024. 16, 17

work page arXiv 2024

Showing first 80 references.