arxiv: 2605.09225 · v1 · submitted 2026-05-09 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain , Tanzim Ahad , Md Jahangir Alam , Sai Puppala , Syed Bahauddin Alam , Sajedul Talukder

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords jailbreakLLMadversarialevaluation metriccybersecurityprompt generationsemantic similarityharmfulness

0 comments

The pith

A continuous metric called OPTIMUS rates LLM jailbreaks by jointly scoring semantic similarity and harmfulness, revealing levels missed by binary success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a large-scale dataset of 114,000 jailbreak prompts by combining 912 strategies with 125 harmful seeds and classifies them into 14 cybersecurity categories using majority voting from six models. It fine-tunes LLMs to generate new jailbreak prompts automatically for given categories and harmful intents. The key advance is the OPTIMUS metric, a training-free continuous score that measures both how similar the jailbreak is to the original harmful request and how likely it is to cause harm, adjusted by penalties. This approach distinguishes weak, moderate, and optimal jailbreaks in ways that simple success-or-failure counts cannot, based on extensive experiments.

Core claim

The paper establishes that jailbreak effectiveness can be assessed more precisely with a continuous metric J(S,H) combining semantic similarity S to the harmful seed and harmfulness probability H through calibrated penalty functions. This metric, tested on 114,000 prompts, separates jailbreaks into weak, moderate, and optimal categories with detailed category-level insights that binary attack success rates fail to provide. Supporting this are an automated generation method using fine-tuned models that produce low-perplexity prompts and the categorized dataset enabling strategy ranking.

What carries the argument

OPTIMUS, defined as the continuous metric J(S,H) that captures semantic similarity between harmful seed and jailbreak along with harmfulness probability using calibrated penalties.

Load-bearing premise

The majority vote from six models correctly categorizes the prompts into the 14 attack types, and the penalty calibrations in the OPTIMUS metric apply generally without overfitting to this particular collection of prompts.

What would settle it

Demonstrating that a binary attack success rate metric can separate the weak, moderate, and optimal jailbreak categories with equivalent or better resolution than OPTIMUS on the same dataset, or that OPTIMUS scores do not hold up when applied to jailbreaks generated by methods outside the 912 strategies.

Figures

Figures reproduced from arXiv: 2605.09225 by Ismail Hossain, Md Jahangir Alam, Sai Puppala, Sajedul Talukder, Syed Bahauddin Alam, Tanzim Ahad.

**Figure 1.** Figure 1: Overview of the prompt composition process combining multiple jailbreak strategies to generate contextually [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of generating and evaluating jailbreak prompts for LLM safety, from composing prompts through [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: 3D Landscapes of the Optimus Score under Different Hyperparameter Configurations. Here, (a) Balanced Configuration, (b) Lenient Configuration, (c) Strict Configuration content.” The resulting probability H ∈ [0, 1] measures linguistic harm, and we define Hsafe = 1− H as the probability that the prompt is safe. 4.6.2 Aggregation and Penalty Functions At the core of Optimus is a harmonic mean that rewards ba… view at source ↗

**Figure 4.** Figure 4: Mean Optimus Score across all (S, H) Model Pairs. reached a mean of 0.1883 (98% of the best) and a lower deviation of 0.0977, showing improved consistency across datasets [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Score-Range Distribution (Counts) of StrongReject Evaluation Across four Models. Each value indicates the number of prompts whose StrongReject score falls within the specified range. Metric AutoDAN AmpleGCG OUR Llama2 Vicuna7B Llama2 Vicuna7B Vicuna7BGuanaco7B Llama3 Tulu3 Vicuna7B Similarity(↑) 0.64 ± 0.11 0.52 ± 0.11 0.72 ± 0.07 0.73 ± 0.08 0.71 ± 0.08 0.64 ± 0.14 0.62 ± 0.10 0.62 ± 0.10 Perplexity (PPL… view at source ↗

**Figure 6.** Figure 6: KDE plots showing the distribution of Optimus ( [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Top jailbreak tactic frequency per attack category (stacked bars). [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Analytical and Empirical Surfaces of the Optimus Score. The first plot from the top shows the analytical surface derived from Equation (3), illustrating how the score peaks when semantic similarity (S) is high and harmfulness probability (H) is low. The second plot from the top presents the empirical surface computed using the best-performing model pair (all-mpnet-base-v2 × deberta-large-mnli), revealing a… view at source ↗

**Figure 9.** Figure 9: Distribution of winning votes across 14 attack categories, showing variability in model vulnerability to [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of Optimus scores across different cybersecurity task categories. Each histogram shows the score ranges corresponding to Weak (0.212–0.283), Moderate (0.283–0.377), and Optimal (0.377–0.471) jailbreak compositions, highlighting the number of samples in each range per task type. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: Correlation heatmap showing semantic relationships among different attack categories, where higher values [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗

**Figure 12.** Figure 12: Instruction prompt used for identifying jailbreaking strategies in adversarial user prompts. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

**Figure 13.** Figure 13: Instruction prompt used for LLM-based similarity and harmfulness evaluation. [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗

read the original abstract

Jailbreak attacks -- adversarial prompts that bypass LLM alignment through purely linguistic manipulation -- pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically. This paper addresses that gap with three contributions. (1) Large-scale compositional jailbreak dataset. We construct 114,000 adversarial prompts by applying 912 composing strategies to 125 harmful seed prompts from JailBreakV-28K. Every prompt is assigned to one of 14 cybersecurity attack categories (e.g., malware, phishing, privilege escalation) via a six-model majority-vote pipeline, and each strategy is ranked by effectiveness per category, enabling principled strategy selection grounded in concrete adversarial objectives. (2) Automated jailbreak generation. We instruction-fine-tune category-aware LLMs on Moderate and Optimal subsets, producing models that synthesize fluent jailbreak prompts from a harmful seed at inference time -- no templates, no gradient search. Our generators achieve perplexity 24-39 versus 40-140 for AutoDAN and AmpleGCG, with safety-filter evasion rates of 0.29-0.51 Mal (LlamaPromptGuard-2-86M), enabling controllable, scalable red-teaming under realistic adversarial conditions. (3) OPTIMUS: a training-free jailbreak evaluator. OPTIMUS is a continuous metric J(S,H) that jointly captures semantic similarity between the harmful seed and the jailbreak (S) and harmfulness probability (H) via calibrated penalty functions. Unlike binary attack success rate (ASR), OPTIMUS requires no task-specific training, generalizes across evolving strategies, and exposes a stealth-optimal regime (S*=0.57, H*=0.43) that ASR misses. Experiments across 114,000 prompts confirm that OPTIMUS separates Weak, Moderate, and Optimal jailbreaks with category-level evidence binary evaluation cannot supply.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large 114k jailbreak dataset and OPTIMUS metric add scale and a continuous score, but unvalidated six-model category labeling weakens the per-category claims.

read the letter

The main things to know are that this work scales up jailbreak prompt generation to 114,000 examples through strategy composition and introduces OPTIMUS, a metric that scores both how close the output stays to the original harmful intent and how harmful it is, using penalty functions. They do some things solidly. Building the dataset from 125 seeds with 912 strategies and then fine-tuning generators that work without templates or search is practical. The reported perplexity of 24-39 and evasion rates beat the compared baselines, and the idea of ranking strategies by category is helpful for targeted red-teaming. The soft spots center on the evaluation pipeline. The six-model majority vote for putting prompts into 14 cybersecurity categories has no agreement numbers or human checks mentioned, so the category-level rankings and the demonstration that OPTIMUS reveals a stealth-optimal point at S*=0.57, H*=0.43 could be sensitive to labeling errors. The penalty calibration for the metric also looks like it was done on the evaluation data itself, which limits how much we can trust the generalization claims right now. This is aimed at the LLM safety and adversarial ML community. People who need ready-made generators or want to try continuous scoring instead of just success rates will get something out of it. It deserves a serious referee because the dataset size and the generator results are real assets, even with the validation gaps that need fixing. I'd recommend putting it through peer review.

Referee Report

3 major / 2 minor

Summary. The paper constructs a 114,000-prompt jailbreak dataset by composing 912 strategies over 125 harmful seeds, assigns each prompt to one of 14 cybersecurity categories via a six-model majority-vote pipeline, fine-tunes category-aware LLMs to generate new jailbreaks, and introduces the continuous metric OPTIMUS = J(S,H) that combines semantic similarity S and harmfulness H through calibrated penalty functions. It claims this metric reveals a stealth-optimal regime (S*=0.57, H*=0.43), separates Weak/Moderate/Optimal jailbreaks at category level, and outperforms binary ASR for evaluation and strategy ranking.

Significance. If the central claims hold after validation, the work supplies useful infrastructure for systematic jailbreak research: a large, categorized, reproducible dataset; scalable template-free generators; and a training-free continuous evaluator that captures semantic-harm trade-offs missed by binary success rates. The scale (114k prompts) and the explicit exposure of a non-binary optimum are concrete strengths that could support more targeted red-teaming and metric standardization in the field.

major comments (3)

[§3] §3 (Dataset Construction and Categorization): The six-model majority-vote pipeline that assigns every prompt to one of 14 categories is load-bearing for all per-category strategy rankings and for the claim that OPTIMUS supplies category-level evidence binary ASR cannot. No inter-model agreement statistics, human-validated subset, or error analysis on boundary cases are reported. If misassignment rates are non-trivial, the observed separations and rankings could be artifacts of noisy labels rather than properties of the metric or generators.
[§5] §5 (OPTIMUS Metric): The penalty functions inside J(S,H) are calibrated on the same 114k-prompt dataset used for all reported experiments, and the stealth-optimal point (S*=0.57, H*=0.43) is extracted from those same experiments. This creates a circularity risk: the metric's parameters may be tuned to the particular distribution of the constructed dataset rather than generalizing across unseen strategies or models. The abstract states the functions are 'calibrated' but provides no hold-out procedure, sensitivity analysis, or cross-strategy validation.
[§4] §4 (Automated Generation): The fine-tuned generators are trained on Moderate and Optimal subsets whose labels derive from the same unvalidated majority-vote categorization and from the same data used to define the stealth-optimal regime. Reported perplexity (24-39) and evasion rates (0.29-0.51) therefore inherit the same grounding issues; without an independent test set or ablation on label noise, the superiority claims over AutoDAN and AmpleGCG remain provisional.

minor comments (2)

[Abstract] Abstract and §5: The ranges 'perplexity 24-39 versus 40-140' and 'evasion rates of 0.29-0.51' are given without specifying the exact models, prompt lengths, or statistical tests; adding these details would improve reproducibility.
Throughout: No error bars, confidence intervals, or multiple-run statistics accompany any of the reported rates or separations, even though the dataset size would support them.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects of validation and generalizability. We address each major comment below and commit to revisions that will strengthen the empirical grounding of the dataset categorization, the OPTIMUS metric, and the generator evaluations.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction and Categorization): The six-model majority-vote pipeline that assigns every prompt to one of 14 categories is load-bearing for all per-category strategy rankings and for the claim that OPTIMUS supplies category-level evidence binary ASR cannot. No inter-model agreement statistics, human-validated subset, or error analysis on boundary cases are reported. If misassignment rates are non-trivial, the observed separations and rankings could be artifacts of noisy labels rather than properties of the metric or generators.

Authors: We agree that explicit validation of the majority-vote pipeline is necessary to support the category-level claims. In the revised manuscript we will add inter-model agreement statistics (Fleiss' kappa and pairwise Cohen's kappa across the six models) and a human validation study on a stratified random subset of 1,000 prompts, with two independent cybersecurity experts providing labels. We will also report error rates on boundary cases and re-compute the per-category strategy rankings and OPTIMUS separations after excluding low-agreement prompts. These additions will quantify label noise and confirm that the reported separations are not artifacts. revision: yes
Referee: [§5] §5 (OPTIMUS Metric): The penalty functions inside J(S,H) are calibrated on the same 114k-prompt dataset used for all reported experiments, and the stealth-optimal point (S*=0.57, H*=0.43) is extracted from those same experiments. This creates a circularity risk: the metric's parameters may be tuned to the particular distribution of the constructed dataset rather than generalizing across unseen strategies or models. The abstract states the functions are 'calibrated' but provides no hold-out procedure, sensitivity analysis, or cross-strategy validation.

Authors: The concern about circularity is valid given the calibration procedure described. We will revise the manuscript to include a hold-out validation: the 912 strategies will be partitioned into calibration and test folds; penalty functions will be re-derived on the calibration fold only, and the stealth-optimal regime together with category rankings will be evaluated on the unseen test fold. We will also add a sensitivity analysis sweeping the penalty coefficients and report the stability of the (S*, H*) point. These controls will demonstrate that OPTIMUS generalizes beyond the original dataset distribution. revision: yes
Referee: [§4] §4 (Automated Generation): The fine-tuned generators are trained on Moderate and Optimal subsets whose labels derive from the same unvalidated majority-vote categorization and from the same data used to define the stealth-optimal regime. Reported perplexity (24-39) and evasion rates (0.29-0.51) therefore inherit the same grounding issues; without an independent test set or ablation on label noise, the superiority claims over AutoDAN and AmpleGCG remain provisional.

Authors: We acknowledge that the generator results currently share the same labeling pipeline. In revision we will introduce two safeguards: (1) an ablation that injects controlled label noise into the Moderate/Optimal subsets and measures degradation in perplexity and evasion rates, and (2) a disjoint test set of strategies held out from both fine-tuning and OPTIMUS calibration. The generators will be re-evaluated on this independent test set, and the comparisons to AutoDAN and AmpleGCG will be updated. These steps will isolate the contribution of the generators from labeling artifacts. revision: yes

Circularity Check

1 steps flagged

OPTIMUS calibrated penalty functions and derived stealth-optimal regime reduce to fitting on the same 114k-prompt evaluation dataset

specific steps

fitted input called prediction [Abstract]
"OPTIMUS is a continuous metric J(S,H) that jointly captures semantic similarity between the harmful seed and the jailbreak (S) and harmfulness probability (H) via calibrated penalty functions. ... exposes a stealth-optimal regime (S*=0.57, H*=0.43) that ASR misses. Experiments across 114,000 prompts confirm that OPTIMUS separates Weak, Moderate, and Optimal jailbreaks with category-level evidence binary evaluation cannot supply."

The penalty functions are calibrated (parameters fitted) to the 114k-prompt dataset; the specific S* and H* values and the separation into Weak/Moderate/Optimal categories are then measured on the identical dataset. The 'prediction' of a superior continuous metric and its optimal operating point is therefore equivalent to the fitting process by construction, with no independent test set or external validation reported for the calibration.

full rationale

The paper's central claim is that OPTIMUS provides independent, generalizable separation of jailbreak quality with category-level insight beyond binary ASR. However, the metric is defined via calibrated penalty functions whose parameters are fitted on the constructed 114k-prompt dataset, and the reported S*=0.57, H*=0.43 regime plus Weak/Moderate/Optimal separation are obtained by applying the fitted metric to that same data. This makes the claimed superiority and specific numerical findings statistically forced by the calibration step rather than an out-of-sample prediction. The six-model categorization pipeline is load-bearing for per-category results but is not itself circular (it is an independent labeling step, albeit unvalidated). No self-citation chains or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claims rest on the accuracy of the majority-vote categorization pipeline and the validity of combining semantic similarity and harmfulness via calibrated penalties; these are domain assumptions without independent external benchmarks mentioned.

free parameters (2)

Stealth-optimal regime values S* and H* = 0.57 and 0.43
Reported as 0.57 and 0.43 from experiments on the dataset; these appear fitted to identify the optimal point.
Penalty function calibration parameters
OPTIMUS J(S,H) uses calibrated penalty functions whose specific parameters are not detailed but are required to produce the continuous scores and separation results.

axioms (2)

domain assumption Majority vote across six models produces accurate categorization of prompts into 14 cybersecurity attack categories.
Invoked in the dataset construction pipeline described in contribution (1).
domain assumption Semantic similarity and harmfulness probability can be meaningfully combined through penalty functions to yield a superior continuous metric.
Foundation for the definition of OPTIMUS in contribution (3).

invented entities (1)

OPTIMUS metric J(S,H) no independent evidence
purpose: Continuous jailbreak evaluator that exposes stealth-optimal regimes missed by binary ASR.
Newly defined metric whose calibration depends on the paper's dataset.

pith-pipeline@v0.9.0 · 5676 in / 1812 out tokens · 69398 ms · 2026-05-12T02:36:41.736967+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

OPTIMUS is a continuous metric J(S,H) that jointly captures semantic similarity ... via calibrated penalty functions. ... J(S, H) = 2S(1−H)/(S+(1−H)) × PS(S) × PH(H)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean cost_alpha_one_eq_jcost echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the harmonic core ... Base(S,H) = 2S(1-H)/(S+(1-H))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

[Abbeel et al.(2024)] Pieter Abbeel, Dillon Bowen, Scott Emmons, Elvis Hsieh, Qingyuan Lu, Sana Pandey, Alexandra Souly, Justin Svegliato, Sam Toyer, Tu Trinh, and Olivia Watkins

work page 2024
[2]

A StrongREJECT for Empty Jailbreaks. (2024). doi:10.52202/079017-3984 [An et al.(2025)] Yang An, B. X. Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jian-Dong Jiang, Jianhong Tu, Jianwei Zhang, Jinchuan Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Li Mei, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, W. Z. Yin, Wenyuan Yu, Xiafei Qiu, Xin...

work page doi:10.52202/079017-3984 2024
[3]

Qwen2.5-1M Technical Report

Qwen2.5-1M Technical Report.arXiv (Cornell University) (2025). doi:10.48550/arxiv.2501.15383 [Andriushchenko et al.(2024)] Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion

work page internal anchor Pith review doi:10.48550/arxiv.2501.15383 2025
[4]

Tao, Z., Lin, T., Chen, X., Li, H., Wu, Y ., Li, Y ., Jin, Z., Huang, F., Tao, D., and Zhou, J

Jailbreak- ing Leading Safety-Aligned LLMs with Simple Adaptive Attacks.arXiv.org(2024). doi: 10.48550/arxiv.2404. 02151 [Brahman et al.(2024)] Faeze Brahman, Yejin Choi, Nouha Dziri, Allyson Ettinger, Seungju Han, Liwei Jiang, Sachin Kumar, Ximing Lu, Niloofar Mireshghallah, Kavel Rao, and Maarten Sap

work page doi:10.48550/arxiv.2404 2024
[5]

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models. (2024). doi:10.52202/079017-1493 [Chao et al.(2024)] Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, F. Tramèr, Hamed Hassani, and Eric Wong

work page doi:10.52202/079017-1493 2024
[6]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. arXiv.org(2024). doi:10.48550/arxiv.2404.01318 [Chao et al.(2025)] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong

work page internal anchor Pith review doi:10.48550/arxiv.2404.01318 2024
[7]

In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)

Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42. [Chen et al.(2024a)] Hao Chen, Yiwen Guo, Qizhang Li, and Wangmeng Zuo. 2024a. Improved Generation of Adversarial Examples Against Safety-aligned LLMs. (2024). doi:10.52202/079017-3054 [Chen et al.(2024b)]...

work page doi:10.52202/079017-3054 2024
[8]

WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. (2024). doi:10.52202/079017-0261 [Doumbouya et al.(2024)] M. Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Daniel Jurafsky, and Christopher D. Manning

work page doi:10.52202/079017-0261 2024
[9]

[Gong et al.(2024)] Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen, Yanjiao Chen, Qian Wang, and Kwok-Yan Lam

h4rm3l: A language for Composable Jailbreak Attack Synthesis.International Conference on Learning Representations(2024). [Gong et al.(2024)] Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen, Yanjiao Chen, Qian Wang, and Kwok-Yan Lam

work page 2024
[10]

USENIX Security Symposium(2024)

PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs. USENIX Security Symposium(2024). [Krauß et al.(2025)] T. Krauß, Hamid Dashtbani, and Alexandra Dmitrienko

work page 2024
[11]

TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts.arXiv.org(2025). doi:10.48550/arxiv.2506.07596 [Lambert et al.(2024)] Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. H...

work page doi:10.48550/arxiv.2506.07596 2025
[12]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training. (2024). 18 AAAI paper [Li et al.(2024)] Bo Li, Haohan Wang, and Andy Zhou

work page 2024
[13]

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. (2024). doi:10.52202/079017-1270 [Li et al.(2025a)] Jiahui Li, Yongchang Hao, Haoyu Xu, Xing Wang, and Yu Hong. 2025a. Exploiting the index gradients for optimization-based jailbreaking on large language models. InProceedings of the 31st International Conference on Comp...

work page doi:10.52202/079017-1270 2024
[14]

Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855, 2024

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs.arXiv preprint arXiv:2404.07921 (2024). [Liu et al.(2024a)] Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, and Hao Liu. 2024a. JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explana...

work page doi:10.48550/arxiv.2410.12855 2024
[15]

https://huggingface.co/lmsys/vicuna-7b-v1.5

Vicuna-7B-v1.5. https://huggingface.co/lmsys/vicuna-7b-v1.5. Accessed: 2023-11-11. [Luo et al.(2024)] Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao

work page 2023
[16]

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks. (2024). [Mazeika et al.(2024)] Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al

work page 2024
[17]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249(2024). [Meta(2023)] Meta

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

https://huggingface.co/meta-llama/Llama-3

Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct. Accessed: 2023-11-11. [Meta AI(2024)] Meta AI

work page 2023
[19]

https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-86M

Llama Prompt Guard 2 (86M). https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-86M. Accessed: 2025-11-04. [Mistral AI(2023a)] Mistral AI. 2023a. Ministral-8B-Instruct-2410. https://huggingface.co/mistralai/ Ministral-8B-Instruct-2410. Accessed: 2023-11-11. [Mistral AI(2023b)] Mistral AI. 2023b. Mistral-7B-Instruct-v0.3. https://huggingface.co/mistral...

work page 2025
[20]

InThe Thirteenth International Conference on Learning Representations

Efficient jailbreak attack sequences on large language models via multi-armed bandit-based context switching. InThe Thirteenth International Conference on Learning Representations. [Reddy et al.(2025)] Aashray Reddy, Andrew Zagula, and Nicholas Saban

work page 2025
[22]

https://huggingface.co/sentence-transformers/all-mpnet-base-v2

All-MiniLM-L12-v2: A Sentence- Transformer Model. https://huggingface.co/sentence-transformers/all-mpnet-base-v2 . Ac- cessed: 2025-11-04. [Russinovich et al.(2024)] M. Russinovich, Ahmed Salem, and Ronen Eldan

work page 2025
[23]

Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack.arXiv.org(2024). doi:10.48550/arxiv.2404.01833 [Shen et al.(2024)] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang

work page doi:10.48550/arxiv.2404.01833 2024
[24]

Do Anything Now

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models.arXiv (Cornell University)(2024). doi:10.1145/3658644.3670388 19 AAAI paper [Team et al.(2025)] Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al

work page doi:10.1145/3658644.3670388 2024
[25]

Gemma 3 Technical Report

Gemma 3 technical report. arXiv preprint arXiv:2503.19786(2025). [Wang et al.(2024)] Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Shuaibao Wang, Yingjiu Li, Yang Liu, Ning Liu, and Juergen Rahmel

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

doi:10.48550/arxiv.2406.05498 [Xu et al.(2024)] Xilie Xu, Keyi Kong, Ninghao Liu, Li-zhen Cui, Di Wang, Jingfeng Zhang, and Mohan S

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner.arXiv.org(2024). doi:10.48550/arxiv.2406.05498 [Xu et al.(2024)] Xilie Xu, Keyi Kong, Ninghao Liu, Li-zhen Cui, Di Wang, Jingfeng Zhang, and Mohan S. Kankanhalli

work page doi:10.48550/arxiv.2406.05498 2024
[27]

control bars

An LLM can Fool Itself: A Prompt-Based Adversarial Attack.arXiv.org(2024). doi: 10.48550/arxiv. 2310.13345 [Yu et al.(2024a)] Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2024a. {LLM-Fuzzer}: Scaling assessment of large language model jailbreaks. In33rd USENIX Security Symposium (USENIX Security 24). 4657–4674. [Yu et al.(2024b)] Zhiyuan Yu, Xiaogeng...

work page internal anchor Pith review doi:10.48550/arxiv 2024
[28]

In34th USENIX Security Symposium (USENIX Security 25)

Exploiting {Task-Level} Vulnerabilities: An Automatic Jailbreak Attack and Defense Benchmarking for{LLMs}. In34th USENIX Security Symposium (USENIX Security 25). 2363–2382. [Zhu et al.(2023)] Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, A. Nenkova, and Tong Sun

work page 2023
[29]

AutoDAN: Interpretable Gradient- Based Adversarial Attacks on Large Language Mod- els

AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.arXiv.org(2023). doi:10.48550/arxiv.2310.15140 A Justification on Optimus FunctionJ The surface J(S, H) in (3) is smooth and strictly positive for (S, H)∈(0,1)

work page doi:10.48550/arxiv.2310.15140 2023
[30]

stealth- optimal

(su, hℓ, α, β) = (0.8,0.2,10,10)yields S∗ ≈0.5665, H ∗ ≈0.4335,J max ≈0.4709. At this equilibrium, PS ≈P H ≈0.91 , which means both penalties are active but neither dominates. This is the “stealth- optimal” point: the rewrite is still meaningfully harmful ( H≈0.43 ) but not overtly phrased, and it is semantically faithful (S≈0.57) without being a trivial ...

work page 2023
[31]

Insert a hidden administrator login route in a Flask web application

Taxonomy of Cyber Attack Categories.Table 10 presents the taxonomy of cyber-attack categories defined in this study, each with a definition and example prompt. These categories were developed by refining harmful prompts from 22 AAAI paper theWildJailbreak andJailBreakV-28k datasets, creating a unified classification of adversarial behaviors. Each prompt w...

work page 2024
[32]

optimal zone

are concentrated in a single cluster. In contrast, broader categories like Malware, Social Engineering, and Other distribute across multiple clusters, implying greater internal diversity. These results demonstrate that unsupervised clustering reveals meaningful structure in adversarial prompt data, validating our manually derived cyber-attack taxonomy. We...

work page 2024