A StrongREJECT for Empty Jailbreaks , url =

Alexandra Souly et al · 2024 · DOI 10.52202/079017-3984

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

Fanfiction subgenres from AO3 function as universal register-based jailbreaks, raising mean attack success rate from 0.278 to 0.731 across eight aligned LLMs on HarmBench and JailbreakBench.

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

cs.CR · 2026-05-09 · unverdicted · novelty 7.0

A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

cs.CR · 2026-06-18 · unverdicted · novelty 5.0

Detect-and-misdirect defenses bound asymptotic attacker success rates in model-guided jailbreaks on agentic AI, unlike detect-and-block which permit near-certain success with sufficient queries.

The Safety-Aware Denoiser for Text Diffusion Models

cs.LG · 2026-04-28

citing papers explorer

Showing 4 of 4 citing papers after filters.

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs cs.CL · 2026-06-03 · unverdicted · none · ref 27
Fanfiction subgenres from AO3 function as universal register-based jailbreaks, raising mean attack success rate from 0.278 to 0.731 across eight aligned LLMs on HarmBench and JailbreakBench.
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring cs.CR · 2026-05-09 · unverdicted · none · ref 2
A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.
Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems cs.CR · 2026-06-18 · unverdicted · none · ref 19
Detect-and-misdirect defenses bound asymptotic attacker success rates in model-guided jailbreaks on agentic AI, unlike detect-and-block which permit near-certain success with sufficient queries.
The Safety-Aware Denoiser for Text Diffusion Models cs.LG · 2026-04-28 · unreviewed · ref 43

A StrongREJECT for Empty Jailbreaks , url =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer