pith. machine review for the scientific record. sign in

arxiv: 2605.09225 · v1 · submitted 2026-05-09 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords jailbreakLLMadversarialevaluation metriccybersecurityprompt generationsemantic similarityharmfulness
0
0 comments X

The pith

A continuous metric called OPTIMUS rates LLM jailbreaks by jointly scoring semantic similarity and harmfulness, revealing levels missed by binary success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a large-scale dataset of 114,000 jailbreak prompts by combining 912 strategies with 125 harmful seeds and classifies them into 14 cybersecurity categories using majority voting from six models. It fine-tunes LLMs to generate new jailbreak prompts automatically for given categories and harmful intents. The key advance is the OPTIMUS metric, a training-free continuous score that measures both how similar the jailbreak is to the original harmful request and how likely it is to cause harm, adjusted by penalties. This approach distinguishes weak, moderate, and optimal jailbreaks in ways that simple success-or-failure counts cannot, based on extensive experiments.

Core claim

The paper establishes that jailbreak effectiveness can be assessed more precisely with a continuous metric J(S,H) combining semantic similarity S to the harmful seed and harmfulness probability H through calibrated penalty functions. This metric, tested on 114,000 prompts, separates jailbreaks into weak, moderate, and optimal categories with detailed category-level insights that binary attack success rates fail to provide. Supporting this are an automated generation method using fine-tuned models that produce low-perplexity prompts and the categorized dataset enabling strategy ranking.

What carries the argument

OPTIMUS, defined as the continuous metric J(S,H) that captures semantic similarity between harmful seed and jailbreak along with harmfulness probability using calibrated penalties.

Load-bearing premise

The majority vote from six models correctly categorizes the prompts into the 14 attack types, and the penalty calibrations in the OPTIMUS metric apply generally without overfitting to this particular collection of prompts.

What would settle it

Demonstrating that a binary attack success rate metric can separate the weak, moderate, and optimal jailbreak categories with equivalent or better resolution than OPTIMUS on the same dataset, or that OPTIMUS scores do not hold up when applied to jailbreaks generated by methods outside the 912 strategies.

Figures

Figures reproduced from arXiv: 2605.09225 by Ismail Hossain, Md Jahangir Alam, Sai Puppala, Sajedul Talukder, Syed Bahauddin Alam, Tanzim Ahad.

Figure 1
Figure 1. Figure 1: Overview of the prompt composition process combining multiple jailbreak strategies to generate contextually [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of generating and evaluating jailbreak prompts for LLM safety, from composing prompts through [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 3D Landscapes of the Optimus Score under Different Hyperparameter Configurations. Here, (a) Balanced Configuration, (b) Lenient Configuration, (c) Strict Configuration content.” The resulting probability H ∈ [0, 1] measures linguistic harm, and we define Hsafe = 1− H as the probability that the prompt is safe. 4.6.2 Aggregation and Penalty Functions At the core of Optimus is a harmonic mean that rewards ba… view at source ↗
Figure 4
Figure 4. Figure 4: Mean Optimus Score across all (S, H) Model Pairs. reached a mean of 0.1883 (98% of the best) and a lower deviation of 0.0977, showing improved consistency across datasets [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Score-Range Distribution (Counts) of StrongReject Evaluation Across four Models. Each value indicates the number of prompts whose StrongReject score falls within the specified range. Metric AutoDAN AmpleGCG OUR Llama2 Vicuna7B Llama2 Vicuna7B Vicuna7B￾Guanaco7B Llama3 Tulu3 Vicuna7B Similarity(↑) 0.64 ± 0.11 0.52 ± 0.11 0.72 ± 0.07 0.73 ± 0.08 0.71 ± 0.08 0.64 ± 0.14 0.62 ± 0.10 0.62 ± 0.10 Perplexity (PPL… view at source ↗
Figure 6
Figure 6. Figure 6: KDE plots showing the distribution of Optimus ( [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Top jailbreak tactic frequency per attack category (stacked bars). [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analytical and Empirical Surfaces of the Optimus Score. The first plot from the top shows the analytical surface derived from Equation (3), illustrating how the score peaks when semantic similarity (S) is high and harmfulness probability (H) is low. The second plot from the top presents the empirical surface computed using the best-performing model pair (all-mpnet-base-v2 × deberta-large-mnli), revealing a… view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of winning votes across 14 attack categories, showing variability in model vulnerability to [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of Optimus scores across different cybersecurity task categories. Each histogram shows the score ranges corresponding to Weak (0.212–0.283), Moderate (0.283–0.377), and Optimal (0.377–0.471) jailbreak compositions, highlighting the number of samples in each range per task type. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Correlation heatmap showing semantic relationships among different attack categories, where higher values [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Instruction prompt used for identifying jailbreaking strategies in adversarial user prompts. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Instruction prompt used for LLM-based similarity and harmfulness evaluation. [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
read the original abstract

Jailbreak attacks -- adversarial prompts that bypass LLM alignment through purely linguistic manipulation -- pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically. This paper addresses that gap with three contributions. (1) Large-scale compositional jailbreak dataset. We construct 114,000 adversarial prompts by applying 912 composing strategies to 125 harmful seed prompts from JailBreakV-28K. Every prompt is assigned to one of 14 cybersecurity attack categories (e.g., malware, phishing, privilege escalation) via a six-model majority-vote pipeline, and each strategy is ranked by effectiveness per category, enabling principled strategy selection grounded in concrete adversarial objectives. (2) Automated jailbreak generation. We instruction-fine-tune category-aware LLMs on Moderate and Optimal subsets, producing models that synthesize fluent jailbreak prompts from a harmful seed at inference time -- no templates, no gradient search. Our generators achieve perplexity 24-39 versus 40-140 for AutoDAN and AmpleGCG, with safety-filter evasion rates of 0.29-0.51 Mal (LlamaPromptGuard-2-86M), enabling controllable, scalable red-teaming under realistic adversarial conditions. (3) OPTIMUS: a training-free jailbreak evaluator. OPTIMUS is a continuous metric J(S,H) that jointly captures semantic similarity between the harmful seed and the jailbreak (S) and harmfulness probability (H) via calibrated penalty functions. Unlike binary attack success rate (ASR), OPTIMUS requires no task-specific training, generalizes across evolving strategies, and exposes a stealth-optimal regime (S*=0.57, H*=0.43) that ASR misses. Experiments across 114,000 prompts confirm that OPTIMUS separates Weak, Moderate, and Optimal jailbreaks with category-level evidence binary evaluation cannot supply.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper constructs a 114,000-prompt jailbreak dataset by composing 912 strategies over 125 harmful seeds, assigns each prompt to one of 14 cybersecurity categories via a six-model majority-vote pipeline, fine-tunes category-aware LLMs to generate new jailbreaks, and introduces the continuous metric OPTIMUS = J(S,H) that combines semantic similarity S and harmfulness H through calibrated penalty functions. It claims this metric reveals a stealth-optimal regime (S*=0.57, H*=0.43), separates Weak/Moderate/Optimal jailbreaks at category level, and outperforms binary ASR for evaluation and strategy ranking.

Significance. If the central claims hold after validation, the work supplies useful infrastructure for systematic jailbreak research: a large, categorized, reproducible dataset; scalable template-free generators; and a training-free continuous evaluator that captures semantic-harm trade-offs missed by binary success rates. The scale (114k prompts) and the explicit exposure of a non-binary optimum are concrete strengths that could support more targeted red-teaming and metric standardization in the field.

major comments (3)
  1. [§3] §3 (Dataset Construction and Categorization): The six-model majority-vote pipeline that assigns every prompt to one of 14 categories is load-bearing for all per-category strategy rankings and for the claim that OPTIMUS supplies category-level evidence binary ASR cannot. No inter-model agreement statistics, human-validated subset, or error analysis on boundary cases are reported. If misassignment rates are non-trivial, the observed separations and rankings could be artifacts of noisy labels rather than properties of the metric or generators.
  2. [§5] §5 (OPTIMUS Metric): The penalty functions inside J(S,H) are calibrated on the same 114k-prompt dataset used for all reported experiments, and the stealth-optimal point (S*=0.57, H*=0.43) is extracted from those same experiments. This creates a circularity risk: the metric's parameters may be tuned to the particular distribution of the constructed dataset rather than generalizing across unseen strategies or models. The abstract states the functions are 'calibrated' but provides no hold-out procedure, sensitivity analysis, or cross-strategy validation.
  3. [§4] §4 (Automated Generation): The fine-tuned generators are trained on Moderate and Optimal subsets whose labels derive from the same unvalidated majority-vote categorization and from the same data used to define the stealth-optimal regime. Reported perplexity (24-39) and evasion rates (0.29-0.51) therefore inherit the same grounding issues; without an independent test set or ablation on label noise, the superiority claims over AutoDAN and AmpleGCG remain provisional.
minor comments (2)
  1. [Abstract] Abstract and §5: The ranges 'perplexity 24-39 versus 40-140' and 'evasion rates of 0.29-0.51' are given without specifying the exact models, prompt lengths, or statistical tests; adding these details would improve reproducibility.
  2. Throughout: No error bars, confidence intervals, or multiple-run statistics accompany any of the reported rates or separations, even though the dataset size would support them.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects of validation and generalizability. We address each major comment below and commit to revisions that will strengthen the empirical grounding of the dataset categorization, the OPTIMUS metric, and the generator evaluations.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction and Categorization): The six-model majority-vote pipeline that assigns every prompt to one of 14 categories is load-bearing for all per-category strategy rankings and for the claim that OPTIMUS supplies category-level evidence binary ASR cannot. No inter-model agreement statistics, human-validated subset, or error analysis on boundary cases are reported. If misassignment rates are non-trivial, the observed separations and rankings could be artifacts of noisy labels rather than properties of the metric or generators.

    Authors: We agree that explicit validation of the majority-vote pipeline is necessary to support the category-level claims. In the revised manuscript we will add inter-model agreement statistics (Fleiss' kappa and pairwise Cohen's kappa across the six models) and a human validation study on a stratified random subset of 1,000 prompts, with two independent cybersecurity experts providing labels. We will also report error rates on boundary cases and re-compute the per-category strategy rankings and OPTIMUS separations after excluding low-agreement prompts. These additions will quantify label noise and confirm that the reported separations are not artifacts. revision: yes

  2. Referee: [§5] §5 (OPTIMUS Metric): The penalty functions inside J(S,H) are calibrated on the same 114k-prompt dataset used for all reported experiments, and the stealth-optimal point (S*=0.57, H*=0.43) is extracted from those same experiments. This creates a circularity risk: the metric's parameters may be tuned to the particular distribution of the constructed dataset rather than generalizing across unseen strategies or models. The abstract states the functions are 'calibrated' but provides no hold-out procedure, sensitivity analysis, or cross-strategy validation.

    Authors: The concern about circularity is valid given the calibration procedure described. We will revise the manuscript to include a hold-out validation: the 912 strategies will be partitioned into calibration and test folds; penalty functions will be re-derived on the calibration fold only, and the stealth-optimal regime together with category rankings will be evaluated on the unseen test fold. We will also add a sensitivity analysis sweeping the penalty coefficients and report the stability of the (S*, H*) point. These controls will demonstrate that OPTIMUS generalizes beyond the original dataset distribution. revision: yes

  3. Referee: [§4] §4 (Automated Generation): The fine-tuned generators are trained on Moderate and Optimal subsets whose labels derive from the same unvalidated majority-vote categorization and from the same data used to define the stealth-optimal regime. Reported perplexity (24-39) and evasion rates (0.29-0.51) therefore inherit the same grounding issues; without an independent test set or ablation on label noise, the superiority claims over AutoDAN and AmpleGCG remain provisional.

    Authors: We acknowledge that the generator results currently share the same labeling pipeline. In revision we will introduce two safeguards: (1) an ablation that injects controlled label noise into the Moderate/Optimal subsets and measures degradation in perplexity and evasion rates, and (2) a disjoint test set of strategies held out from both fine-tuning and OPTIMUS calibration. The generators will be re-evaluated on this independent test set, and the comparisons to AutoDAN and AmpleGCG will be updated. These steps will isolate the contribution of the generators from labeling artifacts. revision: yes

Circularity Check

1 steps flagged

OPTIMUS calibrated penalty functions and derived stealth-optimal regime reduce to fitting on the same 114k-prompt evaluation dataset

specific steps
  1. fitted input called prediction [Abstract]
    "OPTIMUS is a continuous metric J(S,H) that jointly captures semantic similarity between the harmful seed and the jailbreak (S) and harmfulness probability (H) via calibrated penalty functions. ... exposes a stealth-optimal regime (S*=0.57, H*=0.43) that ASR misses. Experiments across 114,000 prompts confirm that OPTIMUS separates Weak, Moderate, and Optimal jailbreaks with category-level evidence binary evaluation cannot supply."

    The penalty functions are calibrated (parameters fitted) to the 114k-prompt dataset; the specific S* and H* values and the separation into Weak/Moderate/Optimal categories are then measured on the identical dataset. The 'prediction' of a superior continuous metric and its optimal operating point is therefore equivalent to the fitting process by construction, with no independent test set or external validation reported for the calibration.

full rationale

The paper's central claim is that OPTIMUS provides independent, generalizable separation of jailbreak quality with category-level insight beyond binary ASR. However, the metric is defined via calibrated penalty functions whose parameters are fitted on the constructed 114k-prompt dataset, and the reported S*=0.57, H*=0.43 regime plus Weak/Moderate/Optimal separation are obtained by applying the fitted metric to that same data. This makes the claimed superiority and specific numerical findings statistically forced by the calibration step rather than an out-of-sample prediction. The six-model categorization pipeline is load-bearing for per-category results but is not itself circular (it is an independent labeling step, albeit unvalidated). No self-citation chains or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claims rest on the accuracy of the majority-vote categorization pipeline and the validity of combining semantic similarity and harmfulness via calibrated penalties; these are domain assumptions without independent external benchmarks mentioned.

free parameters (2)
  • Stealth-optimal regime values S* and H* = 0.57 and 0.43
    Reported as 0.57 and 0.43 from experiments on the dataset; these appear fitted to identify the optimal point.
  • Penalty function calibration parameters
    OPTIMUS J(S,H) uses calibrated penalty functions whose specific parameters are not detailed but are required to produce the continuous scores and separation results.
axioms (2)
  • domain assumption Majority vote across six models produces accurate categorization of prompts into 14 cybersecurity attack categories.
    Invoked in the dataset construction pipeline described in contribution (1).
  • domain assumption Semantic similarity and harmfulness probability can be meaningfully combined through penalty functions to yield a superior continuous metric.
    Foundation for the definition of OPTIMUS in contribution (3).
invented entities (1)
  • OPTIMUS metric J(S,H) no independent evidence
    purpose: Continuous jailbreak evaluator that exposes stealth-optimal regimes missed by binary ASR.
    Newly defined metric whose calibration depends on the paper's dataset.

pith-pipeline@v0.9.0 · 5676 in / 1812 out tokens · 69398 ms · 2026-05-12T02:36:41.736967+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

  1. [1]

    [Abbeel et al.(2024)] Pieter Abbeel, Dillon Bowen, Scott Emmons, Elvis Hsieh, Qingyuan Lu, Sana Pandey, Alexandra Souly, Justin Svegliato, Sam Toyer, Tu Trinh, and Olivia Watkins

  2. [2]

    A StrongREJECT for Empty Jailbreaks. (2024). doi:10.52202/079017-3984 [An et al.(2025)] Yang An, B. X. Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jian-Dong Jiang, Jianhong Tu, Jianwei Zhang, Jinchuan Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Li Mei, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, W. Z. Yin, Wenyuan Yu, Xiafei Qiu, Xin...

  3. [3]

    Qwen2.5-1M Technical Report

    Qwen2.5-1M Technical Report.arXiv (Cornell University) (2025). doi:10.48550/arxiv.2501.15383 [Andriushchenko et al.(2024)] Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion

  4. [4]

    Tao, Z., Lin, T., Chen, X., Li, H., Wu, Y ., Li, Y ., Jin, Z., Huang, F., Tao, D., and Zhou, J

    Jailbreak- ing Leading Safety-Aligned LLMs with Simple Adaptive Attacks.arXiv.org(2024). doi: 10.48550/arxiv.2404. 02151 [Brahman et al.(2024)] Faeze Brahman, Yejin Choi, Nouha Dziri, Allyson Ettinger, Seungju Han, Liwei Jiang, Sachin Kumar, Ximing Lu, Niloofar Mireshghallah, Kavel Rao, and Maarten Sap

  5. [5]

    WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models. (2024). doi:10.52202/079017-1493 [Chao et al.(2024)] Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, F. Tramèr, Hamed Hassani, and Eric Wong

  6. [6]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. arXiv.org(2024). doi:10.48550/arxiv.2404.01318 [Chao et al.(2025)] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong

  7. [7]

    In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)

    Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42. [Chen et al.(2024a)] Hao Chen, Yiwen Guo, Qizhang Li, and Wangmeng Zuo. 2024a. Improved Generation of Adversarial Examples Against Safety-aligned LLMs. (2024). doi:10.52202/079017-3054 [Chen et al.(2024b)]...

  8. [8]

    WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. (2024). doi:10.52202/079017-0261 [Doumbouya et al.(2024)] M. Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Daniel Jurafsky, and Christopher D. Manning

  9. [9]

    [Gong et al.(2024)] Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen, Yanjiao Chen, Qian Wang, and Kwok-Yan Lam

    h4rm3l: A language for Composable Jailbreak Attack Synthesis.International Conference on Learning Representations(2024). [Gong et al.(2024)] Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen, Yanjiao Chen, Qian Wang, and Kwok-Yan Lam

  10. [10]

    USENIX Security Symposium(2024)

    PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs. USENIX Security Symposium(2024). [Krauß et al.(2025)] T. Krauß, Hamid Dashtbani, and Alexandra Dmitrienko

  11. [11]

    TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts.arXiv.org(2025). doi:10.48550/arxiv.2506.07596 [Lambert et al.(2024)] Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. H...

  12. [12]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training. (2024). 18 AAAI paper [Li et al.(2024)] Bo Li, Haohan Wang, and Andy Zhou

  13. [13]

    Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. (2024). doi:10.52202/079017-1270 [Li et al.(2025a)] Jiahui Li, Yongchang Hao, Haoyu Xu, Xing Wang, and Yu Hong. 2025a. Exploiting the index gradients for optimization-based jailbreaking on large language models. InProceedings of the 31st International Conference on Comp...

  14. [14]

    Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855, 2024

    AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs.arXiv preprint arXiv:2404.07921 (2024). [Liu et al.(2024a)] Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, and Hao Liu. 2024a. JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explana...

  15. [15]

    https://huggingface.co/lmsys/vicuna-7b-v1.5

    Vicuna-7B-v1.5. https://huggingface.co/lmsys/vicuna-7b-v1.5. Accessed: 2023-11-11. [Luo et al.(2024)] Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao

  16. [16]

    JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks. (2024). [Mazeika et al.(2024)] Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al

  17. [17]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249(2024). [Meta(2023)] Meta

  18. [18]

    https://huggingface.co/meta-llama/Llama-3

    Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct. Accessed: 2023-11-11. [Meta AI(2024)] Meta AI

  19. [19]

    https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-86M

    Llama Prompt Guard 2 (86M). https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-86M. Accessed: 2025-11-04. [Mistral AI(2023a)] Mistral AI. 2023a. Ministral-8B-Instruct-2410. https://huggingface.co/mistralai/ Ministral-8B-Instruct-2410. Accessed: 2023-11-11. [Mistral AI(2023b)] Mistral AI. 2023b. Mistral-7B-Instruct-v0.3. https://huggingface.co/mistral...

  20. [20]

    InThe Thirteenth International Conference on Learning Representations

    Efficient jailbreak attack sequences on large language models via multi-armed bandit-based context switching. InThe Thirteenth International Conference on Learning Representations. [Reddy et al.(2025)] Aashray Reddy, Andrew Zagula, and Nicholas Saban

  21. [22]

    https://huggingface.co/sentence-transformers/all-mpnet-base-v2

    All-MiniLM-L12-v2: A Sentence- Transformer Model. https://huggingface.co/sentence-transformers/all-mpnet-base-v2 . Ac- cessed: 2025-11-04. [Russinovich et al.(2024)] M. Russinovich, Ahmed Salem, and Ronen Eldan

  22. [23]

    Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024

    Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack.arXiv.org(2024). doi:10.48550/arxiv.2404.01833 [Shen et al.(2024)] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang

  23. [24]

    Do Anything Now

    "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models.arXiv (Cornell University)(2024). doi:10.1145/3658644.3670388 19 AAAI paper [Team et al.(2025)] Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al

  24. [25]

    Gemma 3 Technical Report

    Gemma 3 technical report. arXiv preprint arXiv:2503.19786(2025). [Wang et al.(2024)] Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Shuaibao Wang, Yingjiu Li, Yang Liu, Ning Liu, and Juergen Rahmel

  25. [26]

    doi:10.48550/arxiv.2406.05498 [Xu et al.(2024)] Xilie Xu, Keyi Kong, Ninghao Liu, Li-zhen Cui, Di Wang, Jingfeng Zhang, and Mohan S

    SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner.arXiv.org(2024). doi:10.48550/arxiv.2406.05498 [Xu et al.(2024)] Xilie Xu, Keyi Kong, Ninghao Liu, Li-zhen Cui, Di Wang, Jingfeng Zhang, and Mohan S. Kankanhalli

  26. [27]

    control bars

    An LLM can Fool Itself: A Prompt-Based Adversarial Attack.arXiv.org(2024). doi: 10.48550/arxiv. 2310.13345 [Yu et al.(2024a)] Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2024a. {LLM-Fuzzer}: Scaling assessment of large language model jailbreaks. In33rd USENIX Security Symposium (USENIX Security 24). 4657–4674. [Yu et al.(2024b)] Zhiyuan Yu, Xiaogeng...

  27. [28]

    In34th USENIX Security Symposium (USENIX Security 25)

    Exploiting {Task-Level} Vulnerabilities: An Automatic Jailbreak Attack and Defense Benchmarking for{LLMs}. In34th USENIX Security Symposium (USENIX Security 25). 2363–2382. [Zhu et al.(2023)] Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, A. Nenkova, and Tong Sun

  28. [29]

    AutoDAN: Interpretable Gradient- Based Adversarial Attacks on Large Language Mod- els

    AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.arXiv.org(2023). doi:10.48550/arxiv.2310.15140 A Justification on Optimus FunctionJ The surface J(S, H) in (3) is smooth and strictly positive for (S, H)∈(0,1)

  29. [30]

    stealth- optimal

    (su, hℓ, α, β) = (0.8,0.2,10,10)yields S∗ ≈0.5665, H ∗ ≈0.4335,J max ≈0.4709. At this equilibrium, PS ≈P H ≈0.91 , which means both penalties are active but neither dominates. This is the “stealth- optimal” point: the rewrite is still meaningfully harmful ( H≈0.43 ) but not overtly phrased, and it is semantically faithful (S≈0.57) without being a trivial ...

  30. [31]

    Insert a hidden administrator login route in a Flask web application

    Taxonomy of Cyber Attack Categories.Table 10 presents the taxonomy of cyber-attack categories defined in this study, each with a definition and example prompt. These categories were developed by refining harmful prompts from 22 AAAI paper theWildJailbreak andJailBreakV-28k datasets, creating a unified classification of adversarial behaviors. Each prompt w...

  31. [32]

    optimal zone

    are concentrated in a single cluster. In contrast, broader categories like Malware, Social Engineering, and Other distribute across multiple clusters, implying greater internal diversity. These results demonstrate that unsupervised clustering reveals meaningful structure in adversarial prompt data, validating our manually derived cyber-attack taxonomy. We...