pith. sign in

arxiv: 2506.17231 · v3 · pith:IXHEDSOJnew · submitted 2025-05-26 · 💻 cs.CL · cs.CR

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

classification 💻 cs.CL cs.CR
keywords adversarialattacksjailbreakllmsmodelsdistillationefficientprompt
0
0 comments X
read the original abstract

Current jailbreak attacks on large language models (LLMs) predominantly rely on LLMs themselves to generate adversarial prompts, creating a critical efficiency bottleneck: each attack requires substantial computational resources and API queries, limiting scalability and practical deployment. To overcome this limitation, we propose Adversarial Prompt Distillation (APD), a novel framework that transfers jailbreaking capabilities from LLMs to small language models (SLMs) for efficient, low-resource attacks. APD integrates three key components: (1) masked adversarial knowledge pre-training via LoRA fine-tuning, (2) dynamic temperature-controlled knowledge distillation to bridge architectural gaps, and (3) reinforcement learning-based template optimization for adaptive refinement. Extensive experiments across 12 models show that APD achieves state-of-the-art attack success rates (e.g., 96.4% ASR_k on GPT-4) while dramatically improving efficiency - generating prompts 3.7x faster with 11.3x fewer parameters than teacher models. Our work establishes the first practical framework for lightweight jailbreak attacks, exposes new vulnerabilities in LLM defenses, and provides a scalable testbed for advancing AI safety research. Our code is available at: https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.