Fanfiction subgenres from AO3 function as universal register-based jailbreaks, raising mean attack success rate from 0.278 to 0.731 across eight aligned LLMs on HarmBench and JailbreakBench.
A StrongREJECT for Empty Jailbreaks , url =
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
baseline 1polarities
baseline 1representative citing papers
A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.
Detect-and-misdirect defenses bound asymptotic attacker success rates in model-guided jailbreaks on agentic AI, unlike detect-and-block which permit near-certain success with sufficient queries.
citing papers explorer
-
Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs
Fanfiction subgenres from AO3 function as universal register-based jailbreaks, raising mean attack success rate from 0.278 to 0.731 across eight aligned LLMs on HarmBench and JailbreakBench.
-
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.
-
Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems
Detect-and-misdirect defenses bound asymptotic attacker success rates in model-guided jailbreaks on agentic AI, unlike detect-and-block which permit near-certain success with sufficient queries.
- The Safety-Aware Denoiser for Text Diffusion Models