hub Canonical reference

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing · 2023 · cs.AI · DOI 10.48550/arxiv.2309.10253 · arXiv 2309.10253

Canonical reference. 81% of citing Pith papers cite this work as background.

58 Pith papers citing it

Background 81% of classified citations

open full Pith review browse 58 citing papers arXiv PDF

abstract

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 13 method 2 baseline 1

citation-polarity summary

background 13 baseline 1 support 1 use method 1

representative citing papers

Do Thinking Tokens Help with Safety?

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.

FinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

cs.CR · 2026-06-18 · unverdicted · novelty 7.0

FinRED creates an expert-validated benchmark and rubric for financial LLM safety that maps regulatory standards to specific threats and reduces critical false negatives in evaluation from 28 to 12.

When LLMs Invent Rust Crates: An Empirical Study of Hallucination Patterns and Mitigation

cs.SE · 2026-06-07 · unverdicted · novelty 7.0

First empirical study shows crate hallucination in Rust LLMs has consistent rates across models insensitive to parameters and tests prompt-based mitigation.

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

cs.CR · 2026-05-29 · unverdicted · novelty 7.0

Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

cs.CR · 2026-05-09 · unverdicted · novelty 7.0

Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utility trade-off when trying to eliminate them.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

On the Hardness of Junking LLMs

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

Adaptive Instruction Composition for Automated LLM Red-Teaming

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

cs.SE · 2026-02-02 · unverdicted · novelty 7.0

RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

Break Me If You Can: Self-Jailbreaking of Aligned LLMs via Lexical Insertion Prompting

cs.CL · 2026-01-06 · unverdicted · novelty 7.0

SLIP enables self-jailbreaking of aligned LLMs via lexical insertion in breadth-first tree search, reaching 94.7% average ASR on AdvBench and HarmBench across eleven models with ~7.9 calls.

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

cs.CR · 2024-04-02 · conditional · novelty 7.0

Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.

Mitigating Taint-Style Vulnerabilities in MCP Servers via Security-Aware Tool Descriptions

cs.CR · 2026-07-08 · conditional · novelty 6.0

SPELLSMITH mitigates taint-style vulnerabilities in MCP servers by augmenting tool descriptions with security constraints and adding LLM self-reflection before tool invocation, reducing attack success rates to near zero.

Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces

cs.CR · 2026-07-01 · conditional · novelty 6.0

SMT achieves the highest attack success rate and HarmScore on commercial function-calling LLMs from five providers by using simulated moderation traces in multi-turn trajectories, outperforming baselines with near-minimal queries.

Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security

cs.CR · 2026-06-10 · unverdicted · novelty 6.0

Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.

Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

cs.CR · 2026-06-02 · unverdicted · novelty 6.0

IHO is a new black-box jailbreak attack for LLMs that is adaptive, efficient, transferable across models and behaviors, and effective even against layered defenses without modification.

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

SentGuard achieves 90.5% detection of unsafe cases within two sentences at 7.41% false positive rate by operating at sentence boundaries during LLM streaming generation.

Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

cs.CR · 2026-05-30 · conditional · novelty 6.0

Applies MAP-Elites quality-diversity optimization to evolve semantic attack strategies across dimensions like strategy type, encoding, and length, uncovering distinct vulnerability profiles in four LLMs including GPT-4o-mini and Claude 3.5 Sonnet.

Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling

cs.CR · 2026-05-23 · unverdicted · novelty 6.0

Ellipsoid Control is a white-list test-time jailbreak defense that fits an anisotropic ellipsoid from benign activations to constrain projected gradient descent updates, aiming to improve the safety-utility tradeoff over black-list RepE methods.

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

MemAudit combines counterfactual causal influence scores with memory consistency graphs to identify poisoned records in LLM agent memory, reducing MINJA attack success from 70% to 0% in QA and 83.3% to 0% in reasoning tasks.

Toward Understanding Adversarial Distillation: Why Robust Teachers Fail

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Adversarial distillation improves student robustness when teachers show high uncertainty on robustly unlearnable samples, suppressing noise memorization and allowing reliance on learnable robust signals.

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.

citing papers explorer

Showing 50 of 58 citing papers.

Do Thinking Tokens Help with Safety? cs.LG · 2026-06-23 · unverdicted · none · ref 66 · internal anchor
Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.
FinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming cs.CR · 2026-06-18 · unverdicted · none · ref 44 · internal anchor
FinRED creates an expert-validated benchmark and rubric for financial LLM safety that maps regulatory standards to specific threats and reduces critical false negatives in evaluation from 28 to 12.
When LLMs Invent Rust Crates: An Empirical Study of Hallucination Patterns and Mitigation cs.SE · 2026-06-07 · unverdicted · none · ref 47 · internal anchor
First empirical study shows crate hallucination in Rust LLMs has consistent rates across models insensitive to parameters and tests prompt-based mitigation.
THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 41 · internal anchor
THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models cs.CR · 2026-05-29 · unverdicted · none · ref 11 · internal anchor
Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 37 · internal anchor
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off cs.CR · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utility trade-off when trying to eliminate them.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI cs.HC · 2026-05-07 · unverdicted · none · ref 78 · 2 links · internal anchor
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
On the Hardness of Junking LLMs cs.LG · 2026-05-06 · unverdicted · none · ref 56 · internal anchor
Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming cs.CL · 2026-05-04 · unverdicted · none · ref 35 · internal anchor
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Adaptive Instruction Composition for Automated LLM Red-Teaming cs.CR · 2026-04-22 · unverdicted · none · ref 25 · internal anchor
Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing cs.SE · 2026-02-02 · unverdicted · none · ref 61 · internal anchor
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
Break Me If You Can: Self-Jailbreaking of Aligned LLMs via Lexical Insertion Prompting cs.CL · 2026-01-06 · unverdicted · none · ref 8 · internal anchor
SLIP enables self-jailbreaking of aligned LLMs via lexical insertion in breadth-first tree search, reaching 94.7% average ASR on AdvBench and HarmBench across eleven models with ~7.9 calls.
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack cs.CR · 2024-04-02 · conditional · none · ref 33 · internal anchor
Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.
Mitigating Taint-Style Vulnerabilities in MCP Servers via Security-Aware Tool Descriptions cs.CR · 2026-07-08 · conditional · none · ref 65 · internal anchor
SPELLSMITH mitigates taint-style vulnerabilities in MCP servers by augmenting tool descriptions with security constraints and adding LLM self-reflection before tool invocation, reducing attack success rates to near zero.
Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces cs.CR · 2026-07-01 · conditional · none · ref 36 · internal anchor
SMT achieves the highest attack success rate and HarmScore on commercial function-calling LLMs from five providers by using simulated moderation traces in multi-turn trajectories, outperforming baselines with near-minimal queries.
Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security cs.CR · 2026-06-10 · unverdicted · none · ref 15 · internal anchor
Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.
Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs cs.CR · 2026-06-02 · unverdicted · none · ref 40 · internal anchor
IHO is a new black-box jailbreak attack for LLMs that is adaptive, efficient, transferable across models and behaviors, and effective even against layered defenses without modification.
SentGuard: Sentence-Level Streaming Guardrails for Large Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 40 · internal anchor
SentGuard achieves 90.5% detection of unsafe cases within two sentences at 7.41% false positive rate by operating at sentence boundaries during LLM streaming generation.
Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety cs.CR · 2026-05-30 · conditional · none · ref 10 · internal anchor
Applies MAP-Elites quality-diversity optimization to evolve semantic attack strategies across dimensions like strategy type, encoding, and length, uncovering distinct vulnerability profiles in four LLMs including GPT-4o-mini and Claude 3.5 Sonnet.
Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling cs.CR · 2026-05-23 · unverdicted · none · ref 34 · internal anchor
Ellipsoid Control is a white-list test-time jailbreak defense that fits an anisotropic ellipsoid from benign activations to constrain projected gradient descent updates, aiming to improve the safety-utility tradeoff over black-list RepE methods.
MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection cs.AI · 2026-05-22 · unverdicted · none · ref 3 · internal anchor
MemAudit combines counterfactual causal influence scores with memory consistency graphs to identify poisoned records in LLM agent memory, reducing MINJA attack success from 70% to 0% in QA and 83.3% to 0% in reasoning tasks.
Toward Understanding Adversarial Distillation: Why Robust Teachers Fail cs.LG · 2026-05-21 · unverdicted · none · ref 8 · internal anchor
Adversarial distillation improves student robustness when teachers show high uncertainty on robustly unlearnable samples, suppressing noise memorization and allowing reliance on learnable robust signals.
An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments cs.CR · 2026-05-18 · unverdicted · none · ref 30 · internal anchor
Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures cs.CL · 2026-05-15 · unverdicted · none · ref 3 · 2 links · internal anchor
PQR framework generates diverse realistic queries to elicit QA agent failures, uncovering 23-78% more unhelpful responses than prior methods in e-commerce agent tests.
Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs cs.CR · 2026-05-15 · unverdicted · none · ref 14 · internal anchor
Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.
LLM-Agnostic Semantic Representation Attack cs.CL · 2026-05-09 · unverdicted · none · ref 75 · internal anchor
SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.
The Power of Order: Fooling LLMs with Adversarial Table Permutations cs.LG · 2026-05-01 · unverdicted · none · ref 56 · 2 links · internal anchor
Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption cs.CR · 2026-04-30 · unverdicted · none · ref 33 · internal anchor
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing cs.CL · 2026-04-21 · conditional · none · ref 7 · 2 links · internal anchor
Current LLMs are highly vulnerable to draft-based co-authoring jailbreaks; HarDBench measures this risk and a preference-optimization method reduces harmful completions without large utility loss.
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs cs.CR · 2026-04-14 · unverdicted · none · ref 60 · internal anchor
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models cs.CR · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
ESI framework identifies architecture-specific safety-critical parameters in LLMs, enabling SET to reduce attack success rates by over 50% via 1% weight updates and SPA to limit safety loss to under 1% during instruction tuning.
Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation cs.AI · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
CRA surgically ablates refusal-inducing activation patterns in LLM hidden states during decoding to achieve strong jailbreaks on safety-aligned models.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training cs.CR · 2026-04-09 · unverdicted · none · ref 63 · internal anchor
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 42 · internal anchor
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software cs.SE · 2025-12-28 · unverdicted · none · ref 35 · internal anchor
RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs cs.CL · 2025-11-16 · unverdicted · none · ref 56 · internal anchor
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts cs.CL · 2025-10-08 · unverdicted · none · ref 46 · internal anchor
Red-Bandit adapts online to LLM failure modes by dynamically selecting among RL-trained LoRA attack-style experts via a bandit policy, reporting SOTA ASR@10 on AdvBench with lower-perplexity prompts.
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal cs.SE · 2025-08-15 · unverdicted · none · ref 23 · internal anchor
ORFuzz presents the first evolutionary testing framework for LLM over-refusal together with a new benchmark of 1,855 cases that triggers over-refusal at 63.56% average across ten models.
Exploring the Secondary Risks of Large Language Models cs.LG · 2025-06-14 · unverdicted · none · ref 51 · internal anchor
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment cs.CR · 2024-05-20 · unverdicted · none · ref 27 · internal anchor
SSAG bypasses logit suppression in five LLMs to produce harmful responses at 95% success rate and 86% lower latency; VulMine reaches 77% attack success against defenses.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models cs.CR · 2024-03-28 · accept · none · ref 57 · internal anchor
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.
A StrongREJECT for Empty Jailbreaks cs.LG · 2024-02-15 · conditional · none · ref 40 · internal anchor
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models cs.AI · 2026-06-23 · unverdicted · none · ref 31 · internal anchor
PHANTOM is a consolidated open-source dataset of 47,524 multimodal adversarial samples for VLMs, extending prior benchmarks across 10 high-level categories and 55 subcategories of harmful intents.
Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems cs.CR · 2026-06-18 · unverdicted · none · ref 10 · 2 links · internal anchor
Detect-and-misdirect defenses bound asymptotic attacker success rates in model-guided jailbreaks on agentic AI, unlike detect-and-block which permit near-certain success with sufficient queries.
Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance cs.LG · 2026-06-14 · unverdicted · none · ref 6 · internal anchor
GCD uses diffusion model priors to guide suffix search, achieving higher attack success rates with better semantic adherence and lower detection than GCG-style methods.
SoK: Robustness in Large Language Models against Jailbreak Attacks cs.CR · 2026-05-06 · accept · none · ref 91 · internal anchor
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
Targeted Interpretable Safety Neuron Enhancement for Multilingual Vision-Language Large Models cs.CV · 2026-04-10 · conditional · none · ref 40 · internal anchor
A neuron-targeted safety tuning method for VLLMs reduces attack success rates from ~20-30% to ~4-6% on average across ten languages while using less than 0.03% of parameters.
PIArena: A Platform for Prompt Injection Evaluation cs.CR · 2026-04-09 · unverdicted · none · ref 10 · internal anchor
PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer