Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Benji Peng; Caitlyn Heqi Yin; Hanxuan Chen; Jiacheng Shi; Keyu Chen; Lawrence K.Q. Yan; Ming Liu; Pohsun Feng; Qian Niu; Riyang Bao

arxiv: 2410.15236 · v4 · pith:EA54DF5Ynew · submitted 2024-10-20 · 💻 cs.CR · cs.AI· cs.LG

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Benji Peng , Hanxuan Chen , Keyu Chen , Qian Niu , Ziqian Bi , Ming Liu , Pohsun Feng , Tianyang Wang

show 7 more authors

Lawrence K.Q. Yan Yizhu Wen Yichao Zhang Caitlyn Heqi Yin Xinyuan Song Riyang Bao Jiacheng Shi

This is my paper

classification 💻 cs.CR cs.AIcs.LG

keywords languageresearchreviewvulnerabilitiesalignmentattackattacksdefense

0 comments

read the original abstract

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs
cs.CR 2026-05 unverdicted novelty 6.0

uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.
Multilingual jailbreaking of LLMs using low-resource languages
cs.CL 2026-05 unverdicted novelty 5.0

Multi-turn prompts in Afrikaans, Kiswahili, isiXhosa and isiZulu achieve 52-83% harmful response rates across GPT, Claude, Gemini and others, rising further with native-speaker red-teaming, showing translation quality...
SoK: Robustness in Large Language Models against Jailbreak Attacks
cs.CR 2026-05 accept novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Multi-generation sampling from LLMs uncovers more jailbreak behaviors than single generations, with the largest gains from one to moderate sample counts and diminishing returns thereafter.
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference
cs.CR 2026-04 unverdicted novelty 4.0

A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.
Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts
cs.CR 2025-10 unverdicted novelty 4.0

Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.
Exploring Potential Prompt Injection Attacks in Federated Military LLMs and Their Mitigation
cs.LG 2025-01 unverdicted novelty 2.0

Perspective paper lists secret leakage, free-rider attacks, system disruption, and misinformation as prompt-injection risks in federated military LLMs and proposes red-team wargaming plus joint policy as mitigations.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
cs.CR 2024-09 unverdicted novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.