Recognition: 1 theorem link
· Lean TheoremJailbreak Attacks and Defenses Against Large Language Models: A Survey
Pith reviewed 2026-05-15 02:17 UTC · model grok-4.3
The pith
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods... and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives.
Load-bearing premise
That the proposed taxonomy and sub-classifications accurately and comprehensively capture the current landscape of attacks and defenses without significant omissions or overlaps that would require revision.
read the original abstract
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods... attack methods are divided into black-box and white-box attacks... defense methods into prompt-level and model-level defenses.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
-
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across mul...
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
Training a General Purpose Automated Red Teaming Model
A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.
-
How Adversarial Environments Mislead Agentic AI?
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
Compression as an Adversarial Amplifier Through Decision Space Reduction
Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same pertu...
-
Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models
Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.
-
Insider Attacks in Multi-Agent LLM Consensus Systems
A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
-
SoK: Robustness in Large Language Models against Jailbreak Attacks
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
-
AgentDID: Trustless Identity Authentication for AI Agents
AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.
-
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
Multi-generation sampling from LLMs uncovers more jailbreak behaviors than single generations, with the largest gains from one to moderate sample counts and diminishing returns thereafter.
-
Jailbreaking Large Language Models with Morality Attacks
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
-
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference
A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.
Reference graph
Works this paper leans on
-
[1]
Detecting language model attacks with perplexity
Gabriel Alon and Michael Kamfonas. Detecting Language Model Attacks with Perplexity. CoRR abs/2308.14132, 2023. 12, 16
-
[2]
Jailbreaking leading safety-aligned llms with simple adaptive attacks
Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking Leading Safety- Aligned LLMs with Simple Adaptive Attacks. CoRR abs/2404.02151, 2024. 4
-
[3]
Gemini: A Family of Highly Capable Multimodal Models
Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin John- son, Ioannis Antonoglou, Julian Schrittwieser, Amelia 17 Glaese, Jilin Chen, Emily Pitler, Timothy P. Lilli- crap, Angeliki Lazaridou, Orhan Firat, James Mol- loy, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Anthropic. Introducing claude. https://www. anthropic.com/news/introducing-claude,
-
[5]
Anthropic. Many-shot jailbreaking. https: //www.anthropic.com/research/many-shot- jailbreaking, 2024. 4, 8
work page 2024
-
[6]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Somnath Banerjee, Sayan Layek, Rima Hazra, and Animesh Mukherjee. How (un)ethical are instruction- centric responses of LLMs? Unveiling the vulnera- bilities of safety guardrails to harmful queries. CoRR abs/2402.15302, 2024. 16, 17
-
[8]
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Rishabh Bhardwaj and Soujanya Poria. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. CoRR abs/2308.09662, 2023. 12, 14
-
[9]
Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. In International Conference on Learning Representations (ICLR), 2024. 12, 13, 14
work page 2024
-
[10]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 2020
-
[11]
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.CoRR abs/2309.14348, 2023
Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.CoRR abs/2309.14348, 2023. 12
-
[12]
Explore, Establish, Ex- ploit: Red Teaming Language Models from Scratch
Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell. Explore, Establish, Ex- ploit: Red Teaming Language Models from Scratch. CoRR abs/2306.09442, 2023. 4, 11
-
[13]
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, and Yang Liu. Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues. CoRR abs/2402.09091, 2024. 4, 9
-
[14]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammar- ion, George J. Pappas, Florian Tramèr, Hamed Has- sani, and Eric Wong. JailbreakBench: An Open Ro- bustness Benchmark for Jailbreaking Large Language Models. CoRR abs/2404.01318, 2024. 16, 17
work page internal anchor Pith review arXiv 2024
-
[15]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries. CoRR abs/2310.08419, 2023. 4, 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Har- rison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo- hammad B...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Comprehen- sive Assessment of Jailbreak Attacks Against LLMs
Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Comprehen- sive Assessment of Jailbreak Attacks Against LLMs. CoRR abs/2402.05668, 2024. 2
-
[18]
Attack Prompt Gen- eration for Red Teaming and Defending Large Lan- 18 guage Models
Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qi- fan Wang, and Xiangnan He. Attack Prompt Gen- eration for Red Teaming and Defending Large Lan- 18 guage Models. In Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 2176–2189. ACL, 2023. 12, 13
work page 2023
-
[19]
Jailbreaker: Automated jailbreak across multiple large language model chatbots
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots. CoRR abs/2307.08715, 2023. 4, 10
-
[20]
Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning.CoRR abs/2402.08416, 2024
Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tian- wei Zhang, and Yang Liu. Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning.CoRR abs/2402.08416, 2024. 4, 7
-
[21]
Multilingual Jailbreak Challenges in Large Language Models
Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual Jailbreak Challenges in Large Language Models. In International Conference on Learning Representations (ICLR), 2024. 4, 9
work page 2024
-
[22]
Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yun- sen Xian, Jiajun Chen, and Shujian Huang. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. CoRR abs/2311.08268, 2023. 4, 7
-
[23]
Analyzing the Inherent Response Ten- dency of LLMs: Real-World Instructions-Driven Jail- break
Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, and Bing Qin. Analyzing the Inherent Response Ten- dency of LLMs: Real-World Instructions-Driven Jail- break. CoRR abs/2312.04127, 2023. 4, 5
-
[24]
Aysan Esmradi, Daniel Wankit Yip, and Chun-Fai Chan. A Comprehensive Survey of Attack Tech- niques, Implementation, and Mitigation Strategies in Large Language Models. CoRR abs/2312.10982 ,
-
[25]
Configurable Safety Tuning of Lan- guage Models with Synthetic Preference Data
Víctor Gallego. Configurable Safety Tuning of Lan- guage Models with Synthetic Preference Data. CoRR abs/2404.00495, 2024. 12, 14
-
[26]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield- Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Jo...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
MART: improving LLM safety with multi- round automatic red-teaming
Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yun- ing Mao. MART: improving LLM safety with multi- round automatic red-teaming. CoRR abs/2311.07689,
-
[28]
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping, Alex Stein, Manli Shu, Khalid Sai- fullah, Yuxin Wen, and Tom Goldstein. Coercing LLMs to do and reveal (almost) anything. CoRR abs/2402.14020, 2024. 2
- [29]
-
[30]
FigStep: Jailbreaking Large Vision- language Models via Typographic Visual Prompts
Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. FigStep: Jailbreaking Large Vision- language Models via Typographic Visual Prompts. CoRR abs/2311.05608, 2023. 16, 17
-
[31]
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability. CoRR abs/2402.08679, 2024. 4, 5
-
[32]
From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy
Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. CoRR abs/2307.00691, 2023. 1, 2
-
[33]
Divij Handa, Advait Chirmule, Bimal G. Gajera, and Chitta Baral. Jailbreaking Proprietary Large Lan- guage Models using Word Substitution Cipher. CoRR abs/2402.10601, 2024. 4, 9
-
[34]
Query-based adversarial prompt generation
Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, and Milad Nasr. Query-Based Ad- versarial Prompt Generation. CoRR abs/2402.12329,
-
[35]
Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. Gra- dient Cuff: Detecting Jailbreak Attacks on Large Lan- guage Models by Exploring Refusal Loss Landscapes. CoRR abs/2403.00867, 2024. 12, 14
-
[36]
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In In- ternational Conference on Learning Representations (ICLR), 2024. 4, 5
work page 2024
-
[37]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline Defenses for Adversar- ial Attacks Against Aligned Language Models. CoRR abs/2309.00614, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Defending large language models against jailbreak attacks via semantic smoothing
Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. Defending Large Language Mod- els against Jailbreak Attacks via Semantic Smoothing. CoRR abs/2402.16192, 2024. 12
-
[39]
Beavertails: Towards improved safety alignment of LLM via a human- preference dataset
Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human- preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 14 19
work page 2023
-
[40]
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs. CoRR abs/2402.11753, 2024. 4, 9
-
[41]
Haibo Jin, Ruoxi Chen, Andy Zhou, Jinyin Chen, Yang Zhang, and Haohan Wang. GUARD: role- playing to generate natural-language jailbreakings to test guideline adherence of large language models. CoRR abs/2402.03299, 2024. 4, 11
-
[42]
Dragan, Aditi Raghunathan, and Jacob Steinhardt
Erik Jones, Anca D. Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically Auditing Large Lan- guage Models via Discrete Optimization. In Inter- national Conference on Machine Learning (ICML) , pages 15307–15329. PMLR, 2023. 3, 4, 5
work page 2023
-
[43]
Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks. CoRR abs/2302.05733, 2023. 4, 8
-
[44]
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
Heegyu Kim, Sehyun Yuk, and Hyunsouk Cho. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement. CoRR abs/2402.15180, 2024. 12, 15
-
[45]
Certifying llm safety against adversarial prompting
Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying LLM Safety against Adversarial Prompting. CoRR abs/2309.02705, 2023. 12
-
[46]
Open sesame! universal black box jailbreaking of large language models
Raz Lapid, Ron Langberg, and Moshe Sipper. Open Sesame! Universal Black Box Jailbreaking of Large Language Models. CoRR abs/2309.01446, 2023. 4, 10
-
[47]
Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b
Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. CoRR abs/2310.20624,
-
[48]
Multi-step jailbreaking privacy attacks on chatgpt
Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy At- tacks on ChatGPT. CoRR abs/2304.05197, 2023. 4, 7
-
[49]
A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, and Yinxing Xue. A Cross-Language Investigation into Jailbreak Attacks in Large Language Models. CoRR abs/2401.16765,
-
[50]
Semantic Mirror Jail- break: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Ais- han Liu, and Ee-Chien Chang. Semantic Mirror Jail- break: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs. CoRR abs/2402.14872,
-
[51]
DrAttack: Prompt Decom- position and Reconstruction Makes Powerful LLM Jailbreakers
Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. DrAttack: Prompt Decom- position and Reconstruction Makes Powerful LLM Jailbreakers. CoRR abs/2402.16914, 2024. 9
-
[52]
DeepInception: Hypno- tize Large Language Model to Be Jailbreaker
Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. DeepInception: Hypno- tize Large Language Model to Be Jailbreaker. CoRR abs/2311.03191, 2023. 4, 7
-
[53]
RAIN: your language mod- els can align themselves without finetuning
Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: your language mod- els can align themselves without finetuning. CoRR abs/2309.07124, 2023. 12, 14
-
[54]
Goal- Oriented Prompt Attack and Safety Evaluation for LLMs
Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, and Fei Wu. Goal- Oriented Prompt Attack and Safety Evaluation for LLMs. CoRR abs/2309.11830, 2023. 4, 11
-
[55]
Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, and Kai Chen. Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction. CoRR abs/2402.18104, 2024. 4, 9
-
[56]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451, 2023. 4, 10, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Jailbreaking chat- gpt via prompt engineering: An empirical study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engi- neering: An Empirical Study. CoRR abs/2305.13860,
-
[58]
Yule Liu, Kaitian Chao Ting Lu, Yanshun Zhang, and Yingliang Zhang. Safe and helpful chinese. https://huggingface.co/datasets/DirectLLM/ Safe_and_Helpful_Chinese, 2023. 12, 14
work page 2023
-
[59]
Enhanc- ing LLM safety via constrained direct preference op- timization
Zixuan Liu, Xiaolin Sun, and Zizhan Zheng. Enhanc- ing LLM safety via constrained direct preference op- timization. CoRR abs/2403.02475, 2024. 12, 14
-
[60]
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An Empirical Study of Catas- trophic Forgetting in Large Language Models During Continual Fine-tuning. CoRR abs/2308.08747, 2023. 14
-
[61]
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Lan- guage Models
Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Lan- guage Models. CoRR abs/2402.16717, 2024. 4, 8
-
[62]
PRP: propagating universal pertur- bations to attack large languagenmodel guard-rails
Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, and Atul Prakash. PRP: propagating universal pertur- bations to attack large languagenmodel guard-rails. CoRR abs/2402.15911, 2024. 4, 5
-
[63]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming 20 and Robust Refusal. CoRR abs/2402.04249, 2024. 16, 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Tree of attacks: Jailbreaking black-box llms automatically
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black- Box LLMs Automatically. CoRR abs/2312.02119 ,
-
[65]
OpenAI. GPT-4 technical report. CoRR abs/2303.08774, 2023. 14
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welin- der, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...
-
[67]
Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. Ad- vPrompter: Fast Adaptive Adversarial Prompting for LLMs. CoRR abs/2404.16873, 2024. 16, 17
-
[68]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! CoRR abs/2310.03693, 2023. 4, 6, 14, 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, and Zhenzhong Lan. Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models. CoRR abs/2307.08487,
-
[70]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Annual Con- ference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023. 14
work page 2023
-
[71]
Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, and Anyu Wang. Jail- breakEval: An Integrated Toolkit for Evaluating Jail- break Attempts Against Large Language Models. CoRR abs/2406.09321, 2024. 15, 16
-
[72]
Tricking LLMs into Disobedience: Understanding, Analyzing, and Pre- venting Jailbreaks
Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. Tricking LLMs into Disobedience: Understanding, Analyzing, and Pre- venting Jailbreaks. CoRR abs/2305.14965, 2023. 2
-
[73]
Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. CoRR abs/2310.03684, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul R"ottger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A Test Suite for Identifying Exag- gerated Safety Behaviours in Large Language Models. CoRR abs/2308.01263, 2023. 16, 17
work page internal anchor Pith review arXiv 2023
-
[75]
Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis- François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, and Jordan L. Boyd-Graber. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Compe- tition. In Proceedings of the 2023 Conference on Empirical M...
work page 2023
-
[76]
Scalable and transferable black-box jailbreaks for language models via persona modulation
Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and Transferable Black-Box Jail- breaks for Language Models via Persona Modulation. CoRR abs/2311.03348, 2023. 4, 11
-
[77]
Sharma, Vinayak Gupta, and Dan Gross- man
Reshabh K. Sharma, Vinayak Gupta, and Dan Gross- man. SPML: A DSL for defending language models against prompt attacks. CoRR abs/2402.11755, 2024. 12, 13
-
[78]
Survey of vulnerabilities in large language models revealed by adversarial attacks
Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael B. Abu- Ghazaleh. Survey of Vulnerabilities in Large Lan- guage Models Revealed by Adversarial Attacks. CoRR abs/2310.10844, 2023. 2
-
[79]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Character- izing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825,
-
[80]
At- tackEval: How to Evaluate the Effectiveness of Jail- break Attacking on Large Language Models
Dong Shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, and Yongfeng Zhang. At- tackEval: How to Evaluate the Effectiveness of Jail- break Attacking on Large Language Models. CoRR abs/2401.09002, 2024. 16, 17
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.