A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Mixed citations
X-teaming: Multi- turn jailbreaks and defenses with adaptive multi-agents.arXiv preprint arXiv:2504.13203, 2025
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
This SoK paper introduces a nine-stage taxonomy for LLM guardrail breaches in phishing, characterizes evasion and manipulation tactics, and identifies a dynamic-offense versus static-defense asymmetry.
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
citing papers explorer
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
Conjunctive Prompt Attacks in Multi-Agent LLM Systems
Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
-
SoK: Exposing the Generation and Detection Gaps in LLM-Generated Phishing
This SoK paper introduces a nine-stage taxonomy for LLM guardrail breaches in phishing, characterizes evasion and manipulation tactics, and identifies a dynamic-offense versus static-defense asymmetry.
-
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
-
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
-
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
- SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits